Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

1.7 Standardization

Lesson 7 of 32 in the free Data Visualisation and Analytics notes on Siksha Sarovar, written by Rohit Jangra.

Standardization (Feature Scaling)

1. Why Scale Data?

Variables often have different units and magnitudes. Scaling puts all features on a level playing field.

Study Deep: When Scaling is MANDATORY

If you don't scale your data, your results will be biased toward variables with larger numbers.

  • K-Means / KNN: These use "distance." If Salary is in the thousands and Age is < 100, the algorithm will think a change of ₹1 is more important than a change of 1 year.
  • Gradient Descent: Neural networks and Logistic Regression converge much faster (or at all) when data is scaled to small, similar ranges.

1. Why Scale Data?

Formal Definition: Feature scaling is the process of transforming numerical variables to a common scale without distorting differences in the ranges of values. It is a critical preprocessing step for distance-based algorithms.

Variables often have different units and magnitudes:

  • Example: Age (0–100) and Salary (10,000–1,000,000).

Machine learning algorithms that rely on "distance" calculations (K-Means, KNN, SVM, PCA, Neural Networks) will be dominated by the variable with larger magnitude. Scaling puts all features on a level playing field.

Algorithms Affected vs. Not Affected:

Requires ScalingDoes NOT Require Scaling
K-Means, KNN, SVM, PCADecision Trees, Random Forest
Neural Networks, Logistic Regression, Linear RegressionGradient Boosting (XGBoost, LightGBM)
Distance-based algorithmsTree-based and rule-based algorithms

2. Min-Max Normalization

Rescales data to a fixed range, usually [0, 1].

Formula: X_new = (X - X_min) / (X_max - X_min)

Worked Example: Data: [10, 20, 30, 40, 50]

  • Min = 10, Max = 50, Range = 40
  • For 10: (10 - 10) / 40 = 0.00
  • For 20: (20 - 10) / 40 = 0.25
  • For 30: (30 - 10) / 40 = 0.50
  • For 40: (40 - 10) / 40 = 0.75
  • For 50: (50 - 10) / 40 = 1.00
  • Result: [0.00, 0.25, 0.50, 0.75, 1.00]

3. Z-Score Standardization

Rescales data to have a Mean (μ) of 0 and Standard Deviation (σ) of 1.

Formula: X_new = (X - μ) / σ

Worked Example: Data: [2, 4, 6, 8, 10]

  • Mean (μ) = 6
  • Std Dev (σ) ≈ 2.83
  • For 2: (2 - 6) / 2.83 = -1.41
  • For 6: (6 - 6) / 2.83 = 0.00 (mean becomes 0)
  • For 10: (10 - 6) / 2.83 = +1.41
  • Result: [-1.41, -0.71, 0.00, +0.71, +1.41]

4. Robust Scaling

Uses the Median and IQR instead of Mean/Std Dev. This makes it highly resistant to outliers.

Formula: X_new = (X - Median) / IQR

When to Use: When your data has significant outliers that would distort Min-Max or Z-Score.

Worked Example: Data: [2, 4, 6, 8, 100] (100 is an outlier)

  • Median = 6, Q1 = 3, Q3 = 54, IQR = 51
  • For 2: (2 - 6) / 51 = -0.08
  • For 100: (100 - 6) / 51 = 1.84 (outlier doesn't dominate)

5. Comprehensive Comparison

FeatureMin-Max NormalizationZ-Score StandardizationRobust Scaling
Output RangeFixed [0, 1]Unbounded (centered at 0)Unbounded (centered at 0)
Center MetricUses Min/MaxUses Mean (μ)Uses Median
Spread MetricUses Range (Max-Min)Uses Std Dev (σ)Uses IQR (Q3-Q1)
Outlier SensitivityVery High — one extreme value compresses all othersModerate — outliers shift mean and inflate σLow — Median and IQR are robust
Best ForImage Processing, Neural Networks, algorithms needing bounded inputClustering (K-Means), PCA, Regression, when data is roughly normalData with many outliers, skewed distributions
Preserves Shape?Yes (linear transform)Yes (linear transform)Yes (linear transform)

6. When to Use Which? (Decision Guide)

SituationRecommended Method
Data has no outliers and you need a fixed range (0–1)Min-Max Normalization
Data is roughly normal (bell-shaped)Z-Score Standardization
Data has significant outliers or heavy skewRobust Scaling
Using Neural Networks or image dataMin-Max Normalization
Using PCA, K-Means, or Linear RegressionZ-Score Standardization
Not sure?Start with Z-Score — it's the safest default