Feature Engineering
Feature Engineering is the process of using domain knowledge and creativity to create new features (variables) from existing raw data, or to select and transform existing features, to improve the performance of machine learning models. It is widely regarded as the most important skill that separates good data scientists from great ones.
Formal Definition
Feature Engineering is the act of extracting, transforming, and constructing new input variables (features) from raw data to better represent the underlying problem to predictive models, thereby improving model accuracy, interpretability, and generalization.
---
Why Feature Engineering Matters
- Andrew Ng (Stanford/Google): "Coming up with features is difficult, time-consuming, and requires expert knowledge. Applied machine learning is basically feature engineering."
- A simple model with great features often outperforms a complex model with poor features.
- It bridges the gap between raw data and the patterns that algorithms can learn.
---
Types of Feature Engineering
1. Feature Creation (Deriving New Features)
Creating entirely new columns from existing data using domain knowledge.
a) Date/Time Features
df["signup_date"] = pd.to_datetime(df["signup_date"])
# Extract useful features
df["signup_year"] = df["signup_date"].dt.year
df["signup_month"] = df["signup_date"].dt.month
df["signup_day_of_week"] = df["signup_date"].dt.dayofweek # 0=Mon, 6=Sun
df["is_weekend"] = df["signup_day_of_week"].isin([5, 6]).astype(int)
df["signup_quarter"] = df["signup_date"].dt.quarter
b) Mathematical Combinations
# BMI from height and weight
df["bmi"] = df["weight_kg"] / (df["height_m"] ** 2)
# Ratio features
df["income_per_dependent"] = df["income"] / (df["dependents"] + 1)
# Interaction features
df["area"] = df["length"] * df["width"]
c) Text-Based Features
# Length of text
df["review_length"] = df["review"].apply(len)
# Word count
df["word_count"] = df["review"].apply(lambda x: len(str(x).split()))
# Contains specific keyword
df["has_discount_mention"] = df["review"].str.contains("discount|offer|sale", case=False).astype(int)
d) Aggregation Features
# Customer-level aggregations
customer_agg = df.groupby("customer_id").agg(
total_orders=("order_id", "count"),
avg_order_value=("order_amount", "mean"),
max_order_value=("order_amount", "max"),
total_spent=("order_amount", "sum")
).reset_index()
df = df.merge(customer_agg, on="customer_id", how="left")
---
2. Feature Selection
Not all features are useful. Selecting the right features reduces overfitting, improves accuracy, and speeds up training.
a) Correlation-Based Selection
# Remove features highly correlated with each other (multicollinearity)
corr_matrix = df.corr().abs()
upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
high_corr_cols = [col for col in upper_tri.columns if any(upper_tri[col] > 0.90)]
df = df.drop(columns=high_corr_cols)
b) Variance Threshold
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)
df_selected = selector.fit_transform(df.select_dtypes(include=[np.number]))
c) Recursive Feature Elimination (RFE)
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
rfe = RFE(model, n_features_to_select=10)
rfe.fit(X_train, y_train)
selected_features = X_train.columns[rfe.support_]
print("Selected Features:", list(selected_features))
d) Feature Importance from Tree Models
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
importance = pd.Series(model.feature_importances_, index=X_train.columns)
importance.nlargest(15).plot(kind="barh")
plt.title("Top 15 Feature Importances")
plt.show()
---
3. Feature Transformation
Modifying existing features to improve model compatibility.
| Technique | Description | Example |
|---|---|---|
| Log Transform | Reduce skewness | np.log1p(df["income"]) |
| Polynomial Features | Create interaction terms | age², age×income |
| Binning | Convert numeric to categories | Age → Child, Adult, Senior |
| Scaling | Normalize range | Min-Max or Standard scaling |
Polynomial Features Example:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=True)
X_poly = poly.fit_transform(df[["age", "income"]])
---
Feature Engineering Best Practices
| Practice | Description |
|---|---|
| Start with domain knowledge | Understand what features logically matter for the problem |
| Create features before selecting | Generate many candidates, then prune |
| Avoid data leakage | Never use target-related info as a feature |
| Test feature impact | Compare model performance with and without new features |
| Document your features | Keep a feature dictionary for reproducibility |
| Use cross-validation | Validate that features generalize, not just memorize |
---
Feature Engineering Workflow
Raw Data → Understand the Problem (Domain Knowledge)
→ Create New Features (Date, Text, Aggregation, Math)
→ Transform Features (Scaling, Encoding, Log)
→ Select Features (Correlation, RFE, Importance)
→ Validate (Cross-Validation, Model Comparison)
→ Iterate
---
Summary
- Feature Engineering is the art and science of creating features that help models learn better.
- Feature creation uses domain knowledge to derive new variables (date parts, ratios, aggregations, text features).
- Feature selection removes irrelevant or redundant features (correlation, variance, RFE, tree importance).
- Feature transformation modifies features for better model compatibility (scaling, encoding, polynomial).
- Great feature engineering often matters more than algorithm selection.