Experienced ML practitioners know a truth that surprises beginners: the quality of your features matters more than the sophistication of your model. A linear model with excellent features will often outperform a deep neural network fed raw, untransformed data. Feature engineering -- the process of transforming raw data into representations that ML algorithms can learn from effectively -- is where most of the real leverage in applied ML comes from.
This is also where most ML project time actually goes. The 80/20 rule in ML is roughly: 80% of your time on data and features, 20% on model selection and tuning. This ratio surprises new practitioners who expect the model training to be the hard part.
Why Raw Data Rarely Works Out of the Box
Machine learning models make mathematical assumptions about their inputs. Linear models assume features have a roughly linear relationship with the target. Tree models work best with features that have meaningful split points. Neural networks struggle with features on wildly different scales without normalization.
Raw data violates these assumptions constantly. Income data has a long right tail -- a few billionaires skew the distribution massively. Timestamps are just integers that encode cyclical patterns (Monday is not "7x Sunday"). Categorical variables like country names cannot be fed as strings. Many relationships are non-linear, multiplicative, or interaction-based.
Feature engineering is the process of fixing these problems.
Log Transforms: Handling Skewed Distributions
Many real-world numeric features are right-skewed: most values cluster near the low end with a long tail of large values. Income, transaction amounts, page view counts, company revenue -- all follow roughly this pattern.
Log-transforming skewed features has two benefits. It compresses large values (reducing the influence of outliers) and expands small values (increasing their discriminative power). After a log transform, a distribution that looked like a hockey stick often looks approximately normal -- much more suitable for most ML algorithms.
import numpy as np
df['log_income'] = np.log1p(df['income']) # log1p handles zero values
Use np.log1p (which computes log(1 + x)) rather than np.log to handle zero values without producing negative infinity.
When to apply: whenever a numeric feature has a right-skewed distribution. Check histograms of your features before deciding.
One-Hot Encoding: Handling Categorical Variables
Tree models can sometimes handle raw categorical variables, but most ML algorithms require numeric inputs. The standard approach for nominal categories (categories with no meaningful order) is one-hot encoding: create a binary column for each category value.
"Country: [USA, UK, Canada]" becomes three binary columns: is_usa, is_uk, is_canada. Each row has exactly one "1" and the rest are "0".
df_encoded = pd.get_dummies(df, columns=['country'], drop_first=True)
The drop_first=True parameter drops one category column to avoid perfect multicollinearity (since the dropped category is implied when all others are 0). This matters for linear models; tree models are unaffected.
Pitfall: high-cardinality categoricals (user IDs, zip codes, product SKUs with thousands of unique values) produce enormous feature matrices and often do not encode the right information anyway. For high-cardinality categoricals, prefer target encoding (replace each category with the mean target value for that category, computed on training data only to avoid leakage) or embedding layers in neural networks.
Cyclical Encoding: Handling Time-Based Features
Hour of day, day of week, month of year, and angle measurements are cyclical: hour 23 is adjacent to hour 0, not far from it. One-hot encoding handles this correctly (treats each as a separate category) but discards the cyclical structure. Raw numeric encoding (hour as a number 0-23) tells the model that hour 23 is far from hour 0.
Cyclical encoding using sine and cosine preserves the cyclical structure:
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
Now the model receives two features that together encode both the time and its cyclical relationship with other times. Hour 23 and hour 0 will be close in this representation.
Apply cyclical encoding to: hour of day (period 24), day of week (period 7), day of year (period 365), month (period 12), compass bearing (period 360).
Interaction Features: Encoding Multiplicative Relationships
Many real-world relationships are multiplicative rather than additive. Revenue = price * quantity. Click rate = clicks / impressions. Churn risk might depend on the interaction between tenure and usage level, not either alone.
Interaction features encode these relationships explicitly:
df['revenue'] = df['price'] * df['quantity']
df['ctr'] = df['clicks'] / (df['impressions'] + 1) # +1 avoids division by zero
df['tenure_x_usage'] = df['tenure_months'] * df['weekly_active_days']
Tree models can discover interactions by splitting on both features in sequence, but explicit interaction features reduce the tree depth required to capture them and help linear models that cannot discover interactions at all.
The challenge: with 50 features, there are 1,225 possible pairwise interactions. You cannot try them all blindly. Use domain knowledge to identify which interactions are likely meaningful, or use feature importance scores from a tree model to identify which features are worth interacting.
Polynomial Features and Binning
For features with non-linear relationships to the target, polynomial features can help linear models: add x^2, x^3 alongside the original x.
Binning (discretizing) converts a continuous feature into buckets: age 0-17, 18-34, 35-54, 55+. This can help tree models by creating sharper split points and can encode domain knowledge (the relationship between age and insurance risk is not smooth -- it has distinct thresholds).
Feature Importance: Finding What Actually Matters
After fitting any tree-based model, you can extract feature importance scores that indicate which features contributed most to the model's predictions.
import lightgbm as lgb
model = lgb.LGBMClassifier().fit(X_train, y_train)
importance = pd.Series(model.feature_importances_, index=X_train.columns)
importance.sort_values(ascending=False).head(20).plot(kind='barh')
This serves two purposes: it tells you which features to invest more engineering effort on, and it tells you which features are near-useless (candidates for removal to reduce model complexity and training time).
Permutation importance (randomize one feature at a time and measure performance drop) is more reliable than the impurity-based importance that tree models compute by default. Use sklearn.inspection.permutation_importance for a more honest estimate.
Feature Leakage: The Silent Killer
Feature leakage is including information in your training data that would not be available at prediction time in production. It is the most dangerous mistake in feature engineering because it produces models that appear to work perfectly during development and fail completely in production.
Common leakage examples: using the transaction timestamp to predict fraud when that timestamp is only assigned after the transaction is processed; using a customer's total lifetime value to predict whether they will churn (you would not know this value for a customer who has not churned yet); using future data to create time-series features.
Rule: for every feature, ask "would I have this information at the moment I need to make this prediction in production?" If not, drop the feature.
The 80/20 Reality
Most practitioners find that simple, well-understood features -- properly cleaned, properly encoded, with obvious transformations applied -- account for 80% or more of achievable model performance. The last 20% comes from clever feature engineering, hyperparameter tuning, and ensemble methods.
Start with the basics: fix skewed distributions with log transforms, encode categoricals properly, handle missing values explicitly, add obvious derived features that encode domain knowledge. Only pursue exotic feature engineering after you have the basics right and established a baseline.
Keep Reading
- Decision Trees and Random Forests Explained -- tree-based feature importance is the best tool for guiding feature engineering work
- ML Model Evaluation Metrics Guide -- knowing which metric to optimize tells you how to prioritize feature engineering efforts
- Machine Learning Complete Guide for Software Developers -- the full pipeline that feature engineering fits into
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace -- chat, projects, time tracking, AI meeting summaries, and invoicing -- in one tool. Try it free.