Handling Missing Values
Missing values are one of the most pervasive data quality issues in real-world datasets. How you handle them can significantly impact the accuracy and reliability of your analysis and models.
Formal Definition
A missing value (also called a null value, NaN, or NA) is a data point where no value is stored for the observation in a particular variable. In Python and Pandas, missing values are typically represented as NaN (Not a Number) or None.
---
Why Values Go Missing
| Cause | Description | Example |
|---|---|---|
| Human Error | Manual data entry mistakes | Forgetting to fill a form field |
| System Failure | Sensor malfunction or software bugs | Temperature sensor offline for 2 hours |
| Survey Design | Respondent skips optional questions | "Prefer not to answer" on income |
| Data Merging | Joining datasets with mismatched keys | Left join creates NaN for unmatched rows |
| Privacy Restrictions | Sensitive data intentionally withheld | Medical records with redacted fields |
---
Types of Missing Data
Understanding why data is missing is crucial for choosing the right strategy:
| Type | Full Name | Description | Example |
|---|---|---|---|
| MCAR | Missing Completely at Random | Missingness has no relationship to any variable | Random sensor glitch |
| MAR | Missing at Random | Missingness depends on observed variables but not the missing value itself | Men less likely to report weight |
| MNAR | Missing Not at Random | Missingness depends on the missing value itself | High-income people refusing to report income |
Why This Matters:
- MCAR: Safe to delete rows — no bias introduced.
- MAR: Imputation methods are appropriate.
- MNAR: Most challenging — requires domain knowledge or specialized models.
---
Detecting Missing Values
import pandas as pd
import numpy as np
df = pd.read_csv("data.csv")
# Total missing values per column
print(df.isnull().sum())
# Percentage of missing values per column
print((df.isnull().sum() / len(df)) * 100)
# Heatmap visualization of missing values
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.isnull(), cbar=True, yticklabels=False, cmap="viridis")
plt.title("Missing Value Heatmap")
plt.show()
---
Strategies for Handling Missing Values
1. Deletion Methods
a) Listwise Deletion (Dropping Rows)
Remove entire rows that contain any missing value.
# Drop all rows with any NaN
df_clean = df.dropna()
# Drop rows where specific columns have NaN
df_clean = df.dropna(subset=["age", "income"])
When to use: When the percentage of missing data is very small (< 5%) and data is MCAR.
b) Column Deletion (Dropping Columns)
Remove entire columns that have too many missing values.
# Drop columns with more than 50% missing values
threshold = 0.5
df_clean = df.loc[:, df.isnull().mean() < threshold]
When to use: When a column has >50% missing values and is not critical for analysis.
---
2. Imputation Methods (Filling Missing Values)
a) Mean/Median/Mode Imputation
# Fill with mean (for normally distributed numerical data)
df["age"].fillna(df["age"].mean(), inplace=True)
# Fill with median (for skewed numerical data — more robust)
df["income"].fillna(df["income"].median(), inplace=True)
# Fill with mode (for categorical data)
df["city"].fillna(df["city"].mode()[0], inplace=True)
b) Forward Fill and Backward Fill (for time series)
# Forward fill — carry last known value forward
df["temperature"].fillna(method="ffill", inplace=True)
# Backward fill — use next known value
df["temperature"].fillna(method="bfill", inplace=True)
c) Constant Value Imputation
# Fill with a constant value
df["status"].fillna("Unknown", inplace=True)
df["score"].fillna(0, inplace=True)
d) K-Nearest Neighbors (KNN) Imputation
Uses the values of similar records to predict missing values.
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(
imputer.fit_transform(df.select_dtypes(include=[np.number])),
columns=df.select_dtypes(include=[np.number]).columns
)
e) Iterative Imputation (MICE - Multiple Imputation by Chained Equations)
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer = IterativeImputer(max_iter=10, random_state=42)
df_imputed = pd.DataFrame(
imputer.fit_transform(df.select_dtypes(include=[np.number])),
columns=df.select_dtypes(include=[np.number]).columns
)
---
Comparison of Imputation Strategies
| Method | Best For | Pros | Cons |
|---|---|---|---|
| Deletion | Small % of missing, MCAR | Simple, no bias if MCAR | Loses data, biased if not MCAR |
| Mean | Normally distributed numeric data | Easy to implement | Distorts variance, sensitive to outliers |
| Median | Skewed numeric data | Robust to outliers | Ignores feature relationships |
| Mode | Categorical data | Simple, preserves categories | Over-represents dominant category |
| Forward/Backward Fill | Time-series data | Preserves temporal patterns | Not suitable for non-sequential data |
| KNN | Moderately missing data | Uses inter-feature relationships | Computationally expensive for large data |
| MICE | Complex multivariate data | Most sophisticated, handles MAR | Slow, complex to tune |
---
Summary
- Missing values arise from human error, system failures, survey design, and data merging.
- Understanding the type of missingness (MCAR, MAR, MNAR) guides the correct approach.
- Deletion is simple but can lose information; imputation preserves data but introduces assumptions.
- Advanced methods like KNN and MICE use relationships between features for more accurate imputation.
- Always validate the impact of your chosen strategy on downstream analysis.