Outlier Detection
Outliers are data points that deviate significantly from the majority of observations. They can arise from data entry errors, measurement faults, or genuine rare events. Proper identification and handling of outliers is essential because they can skew statistical measures and degrade model performance.
Formal Definition
An outlier is an observation that lies an abnormal distance from other values in a dataset. Formally, it is a data point that falls significantly outside the overall pattern of a distribution.
---
Types of Outliers
| Type | Description | Example |
|---|---|---|
| Point Outlier | A single data point far from the rest | A salary of ₹1 Crore in a dataset of ₹30K–₹80K |
| Contextual Outlier | Abnormal only in a specific context | 40°C temperature in winter (normal in summer) |
| Collective Outlier | A group of data points that are collectively anomalous | A sudden spike in website traffic for 3 days |
---
Why Outliers Occur
- Data Entry Errors: Typos or incorrect values (e.g., age entered as 999).
- Measurement Errors: Malfunctioning sensors or instruments.
- Natural Variation: Some rare events are genuine (e.g., extremely high income individuals).
- Data Processing Errors: Incorrect merges or transformations.
- Sampling Errors: Non-representative sample capturing extreme cases.
---
Impact of Outliers
| Statistical Measure | Effect of Outliers |
|---|---|
| Mean | Highly sensitive — gets pulled toward the outlier |
| Median | Robust — not significantly affected |
| Standard Deviation | Inflated by outliers |
| Correlation | Can be artificially increased or decreased |
| Regression Models | Slope distorted, poor predictions |
| K-Means Clustering | Centroids pulled toward outliers |
---
Outlier Detection Methods
1. Visual Methods
a) Box Plot (Tukey's Method)
The box plot uses the Interquartile Range (IQR) to define outlier boundaries.
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot(x=df["salary"])
plt.title("Box Plot — Salary Distribution")
plt.show()
b) Scatter Plot
Useful for detecting outliers in two-variable relationships.
plt.scatter(df["age"], df["income"])
plt.xlabel("Age")
plt.ylabel("Income")
plt.title("Age vs Income — Scatter Plot")
plt.show()
c) Histogram
Shows the distribution shape and highlights extreme tails.
df["age"].hist(bins=30)
plt.title("Age Distribution")
plt.show()
---
2. Statistical Methods
a) IQR (Interquartile Range) Method
The most widely used statistical method for outlier detection.
Q1 = df["salary"].quantile(0.25)
Q3 = df["salary"].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Identify outliers
outliers = df[(df["salary"] < lower_bound) | (df["salary"] > upper_bound)]
print(f"Number of outliers: {len(outliers)}")
b) Z-Score Method
Measures how many standard deviations a point is from the mean.
from scipy import stats
z_scores = np.abs(stats.zscore(df["salary"]))
outliers = df[z_scores > 3] # Points beyond 3 standard deviations
print(f"Outliers (Z > 3): {len(outliers)}")
c) Modified Z-Score (Robust Method)
Uses the median instead of mean — more robust for skewed data.
median = df["salary"].median()
mad = np.median(np.abs(df["salary"] - median)) # Median Absolute Deviation
modified_z = 0.6745 * (df["salary"] - median) / mad
outliers = df[np.abs(modified_z) > 3.5]
---
3. Machine Learning Methods
a) Isolation Forest
An unsupervised algorithm that isolates outliers by random partitioning. Outliers are isolated in fewer steps because they are few and different.
from sklearn.ensemble import IsolationForest
iso_forest = IsolationForest(contamination=0.05, random_state=42)
df["anomaly"] = iso_forest.fit_predict(df[["salary", "age"]])
# -1 = outlier, 1 = normal
outliers = df[df["anomaly"] == -1]
b) DBSCAN (Density-Based Spatial Clustering)
Points in low-density regions are flagged as outliers (label = -1).
from sklearn.cluster import DBSCAN
clustering = DBSCAN(eps=3, min_samples=10)
df["cluster"] = clustering.fit_predict(df[["salary", "age"]])
outliers = df[df["cluster"] == -1]
---
Handling Outliers
| Strategy | Method | When to Use |
|---|---|---|
| Remove | Drop outlier rows | When outliers are clearly errors |
| Cap (Winsorize) | Clip values to upper/lower bounds | When you want to keep all rows |
| Transform | Apply log or square root transformation | When data is heavily skewed |
| Impute | Replace outliers with mean/median | When removal is not an option |
| Keep | Leave outliers in the data | When outliers represent genuine rare events |
Capping Example (Winsorization)
# Cap values to IQR boundaries
df["salary"] = df["salary"].clip(lower=lower_bound, upper=upper_bound)
Log Transformation Example
# Log transform to reduce the effect of extreme values
df["salary_log"] = np.log1p(df["salary"])
---
Comparison of Outlier Detection Methods
| Method | Type | Pros | Cons |
|---|---|---|---|
| Box Plot / IQR | Statistical | Simple, visual, interpretable | Assumes symmetry |
| Z-Score | Statistical | Works well for normal distributions | Sensitive to outliers themselves |
| Modified Z-Score | Statistical | Robust with skewed data | Less well-known |
| Isolation Forest | ML-based | Handles high-dimensional data well | Requires tuning contamination parameter |
| DBSCAN | ML-based | No assumption on data distribution | Sensitive to eps and min_samples parameters |
---
Summary
- Outliers are extreme data points that can arise from errors or genuine variation.
- They significantly impact mean, standard deviation, and model performance.
- Visual methods (box plots, scatter plots) provide quick identification.
- Statistical methods (IQR, Z-Score) are effective for univariate detection.
- ML methods (Isolation Forest, DBSCAN) handle multivariate and high-dimensional outlier detection.
- The handling strategy (remove, cap, transform, or keep) depends on the context and the cause of the outlier.