Data Quality: Missing Values & Outliers
1. Missing Values
Handling missing data is critical because most statistical methods and machine learning algorithms cannot process incomplete inputs.
Study Deep: MCAR, MAR, and MNAR
Understanding why data is missing is more important than knowing how to fill it.
- MCAR (Completely at Random): No pattern. (e.g., A sensor battery died). Solution: Delete or Mean Impute.
- MAR (At Random): Pattern depends on other data. (e.g., Younger users skip the 'Annual Income' field). Solution: Impute using Age.
- MNAR (Not at Random): Pattern depends on the missing value itself. (e.g., Users with very high income skip the field to protect privacy). Solution: Requires advanced domain modeling.
1. Missing Values
| Type | Abbreviation | Definition | Example | Implication |
|---|---|---|---|---|
| Missing Completely at Random | MCAR | No pattern — missingness is unrelated to any variable | A respondent accidentally skipped a question | Safe to delete rows (no bias introduced) |
| Missing at Random | MAR | Missingness depends on observed variables, not the missing value itself | Women are less likely to report age, but this is related to gender (observed), not age itself | Use imputation based on observed data |
| Missing Not at Random | MNAR | Missingness depends on the missing value itself | High-income earners skip the "Salary" field because their salary is high | Most difficult — may need domain-specific models |
Treatment Methods — Decision Guide:
| Method | Description | When to Use | Pros | Cons |
|---|---|---|---|---|
| Listwise Deletion | Remove entire rows with any missing data | Missing data < 5%, MCAR | Simple, preserves relationships | Loses data, biased if not MCAR |
| Pairwise Deletion | Use available data for each specific analysis | Moderate missingness | Maximizes available data | Inconsistent sample sizes |
| Mean Imputation | Fill with column average | Normal numerical data, MCAR | Simple, fast | Reduces variance, distorts distribution |
| Median Imputation | Fill with column median | Skewed numerical data with outliers | Robust to outliers | Still distorts distribution |
| Mode Imputation | Fill with most frequent value | Categorical data | Works for non-numeric data | Can overrepresent one category |
| Forward/Backward Fill | Fill with next or previous value | Time-series data | Maintains temporal patterns | Can propagate errors |
| KNN Imputation | Use K-Nearest Neighbors to predict missing value | Complex patterns, sufficient data | Considers relationships between features | Computationally expensive |
| Multiple Imputation | Create multiple plausible values, average results | Research / clinical data | Most statistically rigorous | Complex to implement |
2. Outliers
Formal Definition: An outlier is a data observation that lies at an abnormal distance from other values in the sample. Statistically, it is a point that falls outside the expected range of the data distribution.
Example: Salaries: [40k, 42k, 45k, 1M, 43k] -> 1M is an outlier.
Types of Outliers:
- Point Outlier: A single data point far from the rest (e.g., Age = 200 in a human dataset).
- Contextual Outlier: Abnormal in a specific context (e.g., 30°C is normal in summer, outlier in winter).
- Collective Outlier: A subset of data points that is anomalous as a group (e.g., sudden traffic spike on a server).
Detection Methods:
| Method | Formula / Rule | Assumption | Best For | ||
|---|---|---|---|---|---|
| Z-Score | Z = (X - μ) / σ; Outlier if | Z | > 3 | Data follows Normal Distribution | Normally distributed data |
| IQR | IQR = Q3 - Q1; Outlier if X < Q1 - 1.5IQR or X > Q3 + 1.5IQR | None (non-parametric) | Skewed data, general-purpose | ||
| Modified Z-Score | Uses Median instead of Mean for robustness | None | Data with existing outliers | ||
| Isolation Forest | ML-based anomaly detection | None | High-dimensional data, complex patterns |
Worked Example (IQR Method): Data: [10, 15, 18, 20, 22, 25, 100]
- Q1 (25th percentile) = 15
- Q3 (75th percentile) = 25
- IQR = 25 - 15 = 10
- Lower Bound =
15 - 1.5(10) = 0 - Upper Bound =
25 + 1.5(10) = 40 - 100 > 40 → 100 is an outlier.
Treatment Decision Framework:
| Action | When to Apply | Example |
|---|---|---|
| Remove | Data entry error or impossible value | Age = -5, Temperature = 999°C |
| Cap/Floor (Winsorization) | Replace extreme values with a threshold (e.g., 99th percentile) | Capping salaries at 99th percentile for fair comparison |
| Transform | Apply log or square root transformation to reduce impact | Log-transforming highly skewed income data |
| Keep | The outlier represents a genuine, meaningful rare event | Fraud detection, rare disease identification |
| Separate Analysis | Analyze outliers as a distinct group | VIP customers with abnormally high spending |
3. Python Code: Handling Missing Values
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Age': [25, 30, np.nan, 22, 35, np.nan],
'Salary': [50000, 60000, 45000, np.nan, 70000, 55000],
'Gender': ['M', 'F', 'M', np.nan, 'F', 'M']
})
# 1. Detect missing values
print(df.isnull().sum()) # Count per column
print(df.isnull().sum() / len(df)) # Percentage per column
# 2. Drop rows where > 50% values are missing
df_clean = df.dropna(thresh=len(df.columns) // 2)
# 3. Mean imputation (numeric columns)
df['Age'].fillna(df['Age'].mean(), inplace=True)
# 4. Median imputation (robust to outliers)
df['Salary'].fillna(df['Salary'].median(), inplace=True)
# 5. Mode imputation (categorical columns)
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
4. Python Code: Outlier Detection with IQR
import pandas as pd
data = pd.Series([10, 15, 18, 20, 22, 25, 100])
# IQR Method
Q1 = data.quantile(0.25) # 15
Q3 = data.quantile(0.75) # 25
IQR = Q3 - Q1 # 10
lower = Q1 - 1.5 * IQR # 0
upper = Q3 + 1.5 * IQR # 40
outliers = data[(data < lower) | (data > upper)]
print("Outliers:", outliers.values) # [100]
# Z-Score Method
from scipy import stats
z_scores = stats.zscore(data)
outliers_z = data[abs(z_scores) > 3]
print("Z-Score Outliers:", outliers_z.values)
# Capping / Winsorization (replace outliers at threshold)
data_capped = data.clip(lower=lower, upper=upper)
print("After Capping:", data_capped.values) # 100 → 40
5. Exam-Ready Summary
| Concept | Formula | Decision Rule | ||
|---|---|---|---|---|
| Z-Score Outlier | Z = (X-μ)/σ | Outlier if | Z | > 3 |
| IQR Lower Bound | Q1 − 1.5×IQR | Outlier if X < Lower Bound | ||
| IQR Upper Bound | Q3 + 1.5×IQR | Outlier if X > Upper Bound | ||
| MCAR | No pattern to missingness | Safe to delete rows | ||
| MAR | Missingness depends on other columns | Impute using those columns | ||
| MNAR | Missingness depends on hidden value | Hardest — domain knowledge needed |