Standardization (Feature Scaling)
1. Why Scale Data?
Variables often have different units and magnitudes. Scaling puts all features on a level playing field.
Study Deep: When Scaling is MANDATORY
If you don't scale your data, your results will be biased toward variables with larger numbers.
- K-Means / KNN: These use "distance." If Salary is in the thousands and Age is < 100, the algorithm will think a change of ₹1 is more important than a change of 1 year.
- Gradient Descent: Neural networks and Logistic Regression converge much faster (or at all) when data is scaled to small, similar ranges.
1. Why Scale Data?
Formal Definition: Feature scaling is the process of transforming numerical variables to a common scale without distorting differences in the ranges of values. It is a critical preprocessing step for distance-based algorithms.
Variables often have different units and magnitudes:
- Example: Age (0–100) and Salary (10,000–1,000,000).
Machine learning algorithms that rely on "distance" calculations (K-Means, KNN, SVM, PCA, Neural Networks) will be dominated by the variable with larger magnitude. Scaling puts all features on a level playing field.
Algorithms Affected vs. Not Affected:
| Requires Scaling | Does NOT Require Scaling |
|---|---|
| K-Means, KNN, SVM, PCA | Decision Trees, Random Forest |
| Neural Networks, Logistic Regression, Linear Regression | Gradient Boosting (XGBoost, LightGBM) |
| Distance-based algorithms | Tree-based and rule-based algorithms |
2. Min-Max Normalization
Rescales data to a fixed range, usually [0, 1].
Formula: X_new = (X - X_min) / (X_max - X_min)
Worked Example: Data: [10, 20, 30, 40, 50]
- Min = 10, Max = 50, Range = 40
- For 10:
(10 - 10) / 40 = 0.00 - For 20:
(20 - 10) / 40 = 0.25 - For 30:
(30 - 10) / 40 = 0.50 - For 40:
(40 - 10) / 40 = 0.75 - For 50:
(50 - 10) / 40 = 1.00 - Result: [0.00, 0.25, 0.50, 0.75, 1.00]
3. Z-Score Standardization
Rescales data to have a Mean (μ) of 0 and Standard Deviation (σ) of 1.
Formula: X_new = (X - μ) / σ
Worked Example: Data: [2, 4, 6, 8, 10]
- Mean (μ) = 6
- Std Dev (σ) ≈ 2.83
- For 2:
(2 - 6) / 2.83 = -1.41 - For 6:
(6 - 6) / 2.83 = 0.00(mean becomes 0) - For 10:
(10 - 6) / 2.83 = +1.41 - Result: [-1.41, -0.71, 0.00, +0.71, +1.41]
4. Robust Scaling
Uses the Median and IQR instead of Mean/Std Dev. This makes it highly resistant to outliers.
Formula: X_new = (X - Median) / IQR
When to Use: When your data has significant outliers that would distort Min-Max or Z-Score.
Worked Example: Data: [2, 4, 6, 8, 100] (100 is an outlier)
- Median = 6, Q1 = 3, Q3 = 54, IQR = 51
- For 2:
(2 - 6) / 51 = -0.08 - For 100:
(100 - 6) / 51 = 1.84(outlier doesn't dominate)
5. Comprehensive Comparison
| Feature | Min-Max Normalization | Z-Score Standardization | Robust Scaling |
|---|---|---|---|
| Output Range | Fixed [0, 1] | Unbounded (centered at 0) | Unbounded (centered at 0) |
| Center Metric | Uses Min/Max | Uses Mean (μ) | Uses Median |
| Spread Metric | Uses Range (Max-Min) | Uses Std Dev (σ) | Uses IQR (Q3-Q1) |
| Outlier Sensitivity | Very High — one extreme value compresses all others | Moderate — outliers shift mean and inflate σ | Low — Median and IQR are robust |
| Best For | Image Processing, Neural Networks, algorithms needing bounded input | Clustering (K-Means), PCA, Regression, when data is roughly normal | Data with many outliers, skewed distributions |
| Preserves Shape? | Yes (linear transform) | Yes (linear transform) | Yes (linear transform) |
6. When to Use Which? (Decision Guide)
| Situation | Recommended Method |
|---|---|
| Data has no outliers and you need a fixed range (0–1) | Min-Max Normalization |
| Data is roughly normal (bell-shaped) | Z-Score Standardization |
| Data has significant outliers or heavy skew | Robust Scaling |
| Using Neural Networks or image data | Min-Max Normalization |
| Using PCA, K-Means, or Linear Regression | Z-Score Standardization |
| Not sure? | Start with Z-Score — it's the safest default |