Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

Data Transformation

Lesson 35 of 37 in the free Data Science notes on Siksha Sarovar, written by Rohit Jangra.

Data Transformation

Data Transformation is the process of converting data from one format, structure, or value range into another. It is a critical preprocessing step that ensures data is in the right shape and scale for effective analysis and machine learning.

Formal Definition

Data Transformation refers to the application of mathematical, statistical, or structural operations to data to make it more suitable for analysis. This includes scaling numerical features, encoding categorical variables, normalizing distributions, and reshaping data structures.

---

Why Data Transformation is Necessary

  • Algorithms are sensitive to scale: Distance-based algorithms (KNN, K-Means, SVM) are heavily affected by features with different scales. A feature in thousands (salary) will dominate a feature in single digits (age).
  • Non-normal distributions: Many statistical tests and ML algorithms assume normally distributed data. Transformations can help achieve this.
  • Categorical data must be encoded: Machine learning algorithms work with numbers, not text labels like "Male" or "Female."
  • Reducing skewness: Highly skewed data can distort model performance.

---

Types of Data Transformation

1. Scaling (Feature Scaling)

Scaling brings numerical features to a common range so that no single feature dominates.

a) Min-Max Normalization (Rescaling to 0–1)

Formula: X_norm = (X - X_min) / (X_max - X_min)

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[["age", "salary"]] = scaler.fit_transform(df[["age", "salary"]])

b) Standardization (Z-Score Scaling)

Formula: X_std = (X - μ) / σ

Transforms data to have mean = 0 and standard deviation = 1.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[["age", "salary"]] = scaler.fit_transform(df[["age", "salary"]])

c) Robust Scaling

Uses median and IQR instead of mean and standard deviation. Resistant to outliers.

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
df[["age", "salary"]] = scaler.fit_transform(df[["age", "salary"]])

Comparison of Scaling Methods

MethodFormulaWhen to Use
Min-Max(X - min) / (max - min)When you need values in a fixed range (0–1)
Standard(X - μ) / σWhen data is normally distributed
Robust(X - median) / IQRWhen data has significant outliers

---

2. Encoding Categorical Variables

Machine learning models require numerical inputs. Categorical variables must be converted to numbers.

a) Label Encoding

Assigns an integer to each category. Suitable for ordinal data (data with a natural order).

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df["education"] = le.fit_transform(df["education"])
# High School → 0, Bachelor's → 1, Master's → 2

b) One-Hot Encoding

Creates a binary column for each category. Suitable for nominal data (no inherent order).

df_encoded = pd.get_dummies(df, columns=["city"], drop_first=True)
# Creates: city_Delhi, city_Mumbai, city_Kolkata (binary 0/1)

c) Ordinal Encoding

Explicitly maps categories to ordered integers.

from sklearn.preprocessing import OrdinalEncoder

oe = OrdinalEncoder(categories=[["Low", "Medium", "High"]])
df[["priority"]] = oe.fit_transform(df[["priority"]])

Encoding Comparison

MethodType of DataPreserves Order?Creates Extra Columns?
Label EncodingOrdinalYesNo
One-Hot EncodingNominalNo (binary flags)Yes
Ordinal EncodingOrdinalYes (explicit)No

---

3. Mathematical Transformations

Used to reduce skewness, stabilize variance, or normalize distributions.

a) Log Transformation

# Reduces right-skewness
df["income_log"] = np.log1p(df["income"])  # log(1 + x) to handle zeros

b) Square Root Transformation

df["count_sqrt"] = np.sqrt(df["count"])

c) Box-Cox Transformation

Automatically finds the best power transformation for normality. Only works on positive values.

from scipy.stats import boxcox

df["income_bc"], lambda_val = boxcox(df["income"] + 1)
print(f"Optimal lambda: {lambda_val}")

d) Yeo-Johnson Transformation

Similar to Box-Cox but works with zero and negative values.

from sklearn.preprocessing import PowerTransformer

pt = PowerTransformer(method="yeo-johnson")
df[["income"]] = pt.fit_transform(df[["income"]])

Transformation Comparison

TransformationHandles Zeros?Handles Negatives?Best For
LogUse log1pNoRight-skewed data
Square RootYesNoModerate skewness, count data
Box-CoxNo (need +1)NoPositive data, finding optimal power
Yeo-JohnsonYesYesData with mixed signs

---

4. Binning (Discretization)

Converts continuous variables into categorical bins.

# Equal-width bins
df["age_group"] = pd.cut(df["age"], bins=[0, 18, 35, 50, 65, 100],
                          labels=["Child", "Young Adult", "Adult", "Senior", "Elderly"])

# Quantile bins (equal frequency)
df["income_quartile"] = pd.qcut(df["income"], q=4, labels=["Q1", "Q2", "Q3", "Q4"])

---

5. Reshaping Data

a) Pivot Table

pivot = df.pivot_table(values="sales", index="city", columns="month", aggfunc="sum")

b) Melt (Unpivot)

df_melted = pd.melt(df, id_vars=["name"], value_vars=["math", "science"],
                     var_name="subject", value_name="score")

---

Summary

  • Data Transformation prepares raw data for analysis and modeling.
  • Feature scaling (Min-Max, Standard, Robust) ensures features are on comparable scales.
  • Categorical encoding (Label, One-Hot, Ordinal) converts text categories to numbers.
  • Mathematical transformations (Log, Box-Cox, Yeo-Johnson) normalize skewed distributions.
  • Binning converts continuous data into meaningful categorical groups.
  • Reshaping (Pivot, Melt) restructures data for different analytical perspectives.