Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

Feature Engineering

Lesson 36 of 37 in the free Data Science notes on Siksha Sarovar, written by Rohit Jangra.

Feature Engineering

Feature Engineering is the process of using domain knowledge and creativity to create new features (variables) from existing raw data, or to select and transform existing features, to improve the performance of machine learning models. It is widely regarded as the most important skill that separates good data scientists from great ones.

Formal Definition

Feature Engineering is the act of extracting, transforming, and constructing new input variables (features) from raw data to better represent the underlying problem to predictive models, thereby improving model accuracy, interpretability, and generalization.

---

Why Feature Engineering Matters

  • Andrew Ng (Stanford/Google): "Coming up with features is difficult, time-consuming, and requires expert knowledge. Applied machine learning is basically feature engineering."
  • A simple model with great features often outperforms a complex model with poor features.
  • It bridges the gap between raw data and the patterns that algorithms can learn.

---

Types of Feature Engineering

1. Feature Creation (Deriving New Features)

Creating entirely new columns from existing data using domain knowledge.

a) Date/Time Features

df["signup_date"] = pd.to_datetime(df["signup_date"])

# Extract useful features
df["signup_year"] = df["signup_date"].dt.year
df["signup_month"] = df["signup_date"].dt.month
df["signup_day_of_week"] = df["signup_date"].dt.dayofweek  # 0=Mon, 6=Sun
df["is_weekend"] = df["signup_day_of_week"].isin([5, 6]).astype(int)
df["signup_quarter"] = df["signup_date"].dt.quarter

b) Mathematical Combinations

# BMI from height and weight
df["bmi"] = df["weight_kg"] / (df["height_m"] ** 2)

# Ratio features
df["income_per_dependent"] = df["income"] / (df["dependents"] + 1)

# Interaction features
df["area"] = df["length"] * df["width"]

c) Text-Based Features

# Length of text
df["review_length"] = df["review"].apply(len)

# Word count
df["word_count"] = df["review"].apply(lambda x: len(str(x).split()))

# Contains specific keyword
df["has_discount_mention"] = df["review"].str.contains("discount|offer|sale", case=False).astype(int)

d) Aggregation Features

# Customer-level aggregations
customer_agg = df.groupby("customer_id").agg(
    total_orders=("order_id", "count"),
    avg_order_value=("order_amount", "mean"),
    max_order_value=("order_amount", "max"),
    total_spent=("order_amount", "sum")
).reset_index()

df = df.merge(customer_agg, on="customer_id", how="left")

---

2. Feature Selection

Not all features are useful. Selecting the right features reduces overfitting, improves accuracy, and speeds up training.

a) Correlation-Based Selection

# Remove features highly correlated with each other (multicollinearity)
corr_matrix = df.corr().abs()
upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
high_corr_cols = [col for col in upper_tri.columns if any(upper_tri[col] > 0.90)]
df = df.drop(columns=high_corr_cols)

b) Variance Threshold

from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold(threshold=0.01)
df_selected = selector.fit_transform(df.select_dtypes(include=[np.number]))

c) Recursive Feature Elimination (RFE)

from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)
rfe = RFE(model, n_features_to_select=10)
rfe.fit(X_train, y_train)

selected_features = X_train.columns[rfe.support_]
print("Selected Features:", list(selected_features))

d) Feature Importance from Tree Models

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

importance = pd.Series(model.feature_importances_, index=X_train.columns)
importance.nlargest(15).plot(kind="barh")
plt.title("Top 15 Feature Importances")
plt.show()

---

3. Feature Transformation

Modifying existing features to improve model compatibility.

TechniqueDescriptionExample
Log TransformReduce skewnessnp.log1p(df["income"])
Polynomial FeaturesCreate interaction termsage², age×income
BinningConvert numeric to categoriesAge → Child, Adult, Senior
ScalingNormalize rangeMin-Max or Standard scaling

Polynomial Features Example:

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=True)
X_poly = poly.fit_transform(df[["age", "income"]])

---

Feature Engineering Best Practices

PracticeDescription
Start with domain knowledgeUnderstand what features logically matter for the problem
Create features before selectingGenerate many candidates, then prune
Avoid data leakageNever use target-related info as a feature
Test feature impactCompare model performance with and without new features
Document your featuresKeep a feature dictionary for reproducibility
Use cross-validationValidate that features generalize, not just memorize

---

Feature Engineering Workflow

Raw Data → Understand the Problem (Domain Knowledge)
         → Create New Features (Date, Text, Aggregation, Math)
         → Transform Features (Scaling, Encoding, Log)
         → Select Features (Correlation, RFE, Importance)
         → Validate (Cross-Validation, Model Comparison)
         → Iterate

---

Summary

  • Feature Engineering is the art and science of creating features that help models learn better.
  • Feature creation uses domain knowledge to derive new variables (date parts, ratios, aggregations, text features).
  • Feature selection removes irrelevant or redundant features (correlation, variance, RFE, tree importance).
  • Feature transformation modifies features for better model compatibility (scaling, encoding, polynomial).
  • Great feature engineering often matters more than algorithm selection.