Data Transformation
Data Transformation is the process of converting data from one format, structure, or value range into another. It is a critical preprocessing step that ensures data is in the right shape and scale for effective analysis and machine learning.
Formal Definition
Data Transformation refers to the application of mathematical, statistical, or structural operations to data to make it more suitable for analysis. This includes scaling numerical features, encoding categorical variables, normalizing distributions, and reshaping data structures.
---
Why Data Transformation is Necessary
- Algorithms are sensitive to scale: Distance-based algorithms (KNN, K-Means, SVM) are heavily affected by features with different scales. A feature in thousands (salary) will dominate a feature in single digits (age).
- Non-normal distributions: Many statistical tests and ML algorithms assume normally distributed data. Transformations can help achieve this.
- Categorical data must be encoded: Machine learning algorithms work with numbers, not text labels like "Male" or "Female."
- Reducing skewness: Highly skewed data can distort model performance.
---
Types of Data Transformation
1. Scaling (Feature Scaling)
Scaling brings numerical features to a common range so that no single feature dominates.
a) Min-Max Normalization (Rescaling to 0–1)
Formula: X_norm = (X - X_min) / (X_max - X_min)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[["age", "salary"]] = scaler.fit_transform(df[["age", "salary"]])
b) Standardization (Z-Score Scaling)
Formula: X_std = (X - μ) / σ
Transforms data to have mean = 0 and standard deviation = 1.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[["age", "salary"]] = scaler.fit_transform(df[["age", "salary"]])
c) Robust Scaling
Uses median and IQR instead of mean and standard deviation. Resistant to outliers.
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
df[["age", "salary"]] = scaler.fit_transform(df[["age", "salary"]])
Comparison of Scaling Methods
| Method | Formula | When to Use |
|---|---|---|
| Min-Max | (X - min) / (max - min) | When you need values in a fixed range (0–1) |
| Standard | (X - μ) / σ | When data is normally distributed |
| Robust | (X - median) / IQR | When data has significant outliers |
---
2. Encoding Categorical Variables
Machine learning models require numerical inputs. Categorical variables must be converted to numbers.
a) Label Encoding
Assigns an integer to each category. Suitable for ordinal data (data with a natural order).
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["education"] = le.fit_transform(df["education"])
# High School → 0, Bachelor's → 1, Master's → 2
b) One-Hot Encoding
Creates a binary column for each category. Suitable for nominal data (no inherent order).
df_encoded = pd.get_dummies(df, columns=["city"], drop_first=True)
# Creates: city_Delhi, city_Mumbai, city_Kolkata (binary 0/1)
c) Ordinal Encoding
Explicitly maps categories to ordered integers.
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[["Low", "Medium", "High"]])
df[["priority"]] = oe.fit_transform(df[["priority"]])
Encoding Comparison
| Method | Type of Data | Preserves Order? | Creates Extra Columns? |
|---|---|---|---|
| Label Encoding | Ordinal | Yes | No |
| One-Hot Encoding | Nominal | No (binary flags) | Yes |
| Ordinal Encoding | Ordinal | Yes (explicit) | No |
---
3. Mathematical Transformations
Used to reduce skewness, stabilize variance, or normalize distributions.
a) Log Transformation
# Reduces right-skewness
df["income_log"] = np.log1p(df["income"]) # log(1 + x) to handle zeros
b) Square Root Transformation
df["count_sqrt"] = np.sqrt(df["count"])
c) Box-Cox Transformation
Automatically finds the best power transformation for normality. Only works on positive values.
from scipy.stats import boxcox
df["income_bc"], lambda_val = boxcox(df["income"] + 1)
print(f"Optimal lambda: {lambda_val}")
d) Yeo-Johnson Transformation
Similar to Box-Cox but works with zero and negative values.
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method="yeo-johnson")
df[["income"]] = pt.fit_transform(df[["income"]])
Transformation Comparison
| Transformation | Handles Zeros? | Handles Negatives? | Best For |
|---|---|---|---|
| Log | Use log1p | No | Right-skewed data |
| Square Root | Yes | No | Moderate skewness, count data |
| Box-Cox | No (need +1) | No | Positive data, finding optimal power |
| Yeo-Johnson | Yes | Yes | Data with mixed signs |
---
4. Binning (Discretization)
Converts continuous variables into categorical bins.
# Equal-width bins
df["age_group"] = pd.cut(df["age"], bins=[0, 18, 35, 50, 65, 100],
labels=["Child", "Young Adult", "Adult", "Senior", "Elderly"])
# Quantile bins (equal frequency)
df["income_quartile"] = pd.qcut(df["income"], q=4, labels=["Q1", "Q2", "Q3", "Q4"])
---
5. Reshaping Data
a) Pivot Table
pivot = df.pivot_table(values="sales", index="city", columns="month", aggfunc="sum")
b) Melt (Unpivot)
df_melted = pd.melt(df, id_vars=["name"], value_vars=["math", "science"],
var_name="subject", value_name="score")
---
Summary
- Data Transformation prepares raw data for analysis and modeling.
- Feature scaling (Min-Max, Standard, Robust) ensures features are on comparable scales.
- Categorical encoding (Label, One-Hot, Ordinal) converts text categories to numbers.
- Mathematical transformations (Log, Box-Cox, Yeo-Johnson) normalize skewed distributions.
- Binning converts continuous data into meaningful categorical groups.
- Reshaping (Pivot, Melt) restructures data for different analytical perspectives.