Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

Outlier Detection

Lesson 34 of 37 in the free Data Science notes on Siksha Sarovar, written by Rohit Jangra.

Outlier Detection

Outliers are data points that deviate significantly from the majority of observations. They can arise from data entry errors, measurement faults, or genuine rare events. Proper identification and handling of outliers is essential because they can skew statistical measures and degrade model performance.

Formal Definition

An outlier is an observation that lies an abnormal distance from other values in a dataset. Formally, it is a data point that falls significantly outside the overall pattern of a distribution.

---

Types of Outliers

TypeDescriptionExample
Point OutlierA single data point far from the restA salary of ₹1 Crore in a dataset of ₹30K–₹80K
Contextual OutlierAbnormal only in a specific context40°C temperature in winter (normal in summer)
Collective OutlierA group of data points that are collectively anomalousA sudden spike in website traffic for 3 days

---

Why Outliers Occur

  • Data Entry Errors: Typos or incorrect values (e.g., age entered as 999).
  • Measurement Errors: Malfunctioning sensors or instruments.
  • Natural Variation: Some rare events are genuine (e.g., extremely high income individuals).
  • Data Processing Errors: Incorrect merges or transformations.
  • Sampling Errors: Non-representative sample capturing extreme cases.

---

Impact of Outliers

Statistical MeasureEffect of Outliers
MeanHighly sensitive — gets pulled toward the outlier
MedianRobust — not significantly affected
Standard DeviationInflated by outliers
CorrelationCan be artificially increased or decreased
Regression ModelsSlope distorted, poor predictions
K-Means ClusteringCentroids pulled toward outliers

---

Outlier Detection Methods

1. Visual Methods

a) Box Plot (Tukey's Method)

The box plot uses the Interquartile Range (IQR) to define outlier boundaries.

import seaborn as sns
import matplotlib.pyplot as plt

sns.boxplot(x=df["salary"])
plt.title("Box Plot — Salary Distribution")
plt.show()

b) Scatter Plot

Useful for detecting outliers in two-variable relationships.

plt.scatter(df["age"], df["income"])
plt.xlabel("Age")
plt.ylabel("Income")
plt.title("Age vs Income — Scatter Plot")
plt.show()

c) Histogram

Shows the distribution shape and highlights extreme tails.

df["age"].hist(bins=30)
plt.title("Age Distribution")
plt.show()

---

2. Statistical Methods

a) IQR (Interquartile Range) Method

The most widely used statistical method for outlier detection.

Q1 = df["salary"].quantile(0.25)
Q3 = df["salary"].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
outliers = df[(df["salary"] < lower_bound) | (df["salary"] > upper_bound)]
print(f"Number of outliers: {len(outliers)}")

b) Z-Score Method

Measures how many standard deviations a point is from the mean.

from scipy import stats

z_scores = np.abs(stats.zscore(df["salary"]))
outliers = df[z_scores > 3]  # Points beyond 3 standard deviations
print(f"Outliers (Z > 3): {len(outliers)}")

c) Modified Z-Score (Robust Method)

Uses the median instead of mean — more robust for skewed data.

median = df["salary"].median()
mad = np.median(np.abs(df["salary"] - median))  # Median Absolute Deviation
modified_z = 0.6745 * (df["salary"] - median) / mad
outliers = df[np.abs(modified_z) > 3.5]

---

3. Machine Learning Methods

a) Isolation Forest

An unsupervised algorithm that isolates outliers by random partitioning. Outliers are isolated in fewer steps because they are few and different.

from sklearn.ensemble import IsolationForest

iso_forest = IsolationForest(contamination=0.05, random_state=42)
df["anomaly"] = iso_forest.fit_predict(df[["salary", "age"]])
# -1 = outlier, 1 = normal
outliers = df[df["anomaly"] == -1]

b) DBSCAN (Density-Based Spatial Clustering)

Points in low-density regions are flagged as outliers (label = -1).

from sklearn.cluster import DBSCAN

clustering = DBSCAN(eps=3, min_samples=10)
df["cluster"] = clustering.fit_predict(df[["salary", "age"]])
outliers = df[df["cluster"] == -1]

---

Handling Outliers

StrategyMethodWhen to Use
RemoveDrop outlier rowsWhen outliers are clearly errors
Cap (Winsorize)Clip values to upper/lower boundsWhen you want to keep all rows
TransformApply log or square root transformationWhen data is heavily skewed
ImputeReplace outliers with mean/medianWhen removal is not an option
KeepLeave outliers in the dataWhen outliers represent genuine rare events

Capping Example (Winsorization)

# Cap values to IQR boundaries
df["salary"] = df["salary"].clip(lower=lower_bound, upper=upper_bound)

Log Transformation Example

# Log transform to reduce the effect of extreme values
df["salary_log"] = np.log1p(df["salary"])

---

Comparison of Outlier Detection Methods

MethodTypeProsCons
Box Plot / IQRStatisticalSimple, visual, interpretableAssumes symmetry
Z-ScoreStatisticalWorks well for normal distributionsSensitive to outliers themselves
Modified Z-ScoreStatisticalRobust with skewed dataLess well-known
Isolation ForestML-basedHandles high-dimensional data wellRequires tuning contamination parameter
DBSCANML-basedNo assumption on data distributionSensitive to eps and min_samples parameters

---

Summary

  • Outliers are extreme data points that can arise from errors or genuine variation.
  • They significantly impact mean, standard deviation, and model performance.
  • Visual methods (box plots, scatter plots) provide quick identification.
  • Statistical methods (IQR, Z-Score) are effective for univariate detection.
  • ML methods (Isolation Forest, DBSCAN) handle multivariate and high-dimensional outlier detection.
  • The handling strategy (remove, cap, transform, or keep) depends on the context and the cause of the outlier.