Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

1.6 Data Quality & Outliers

Lesson 6 of 32 in the free Data Visualisation and Analytics notes on Siksha Sarovar, written by Rohit Jangra.

Data Quality: Missing Values & Outliers

1. Missing Values

Handling missing data is critical because most statistical methods and machine learning algorithms cannot process incomplete inputs.

Study Deep: MCAR, MAR, and MNAR

Understanding why data is missing is more important than knowing how to fill it.

  1. MCAR (Completely at Random): No pattern. (e.g., A sensor battery died). Solution: Delete or Mean Impute.
  2. MAR (At Random): Pattern depends on other data. (e.g., Younger users skip the 'Annual Income' field). Solution: Impute using Age.
  3. MNAR (Not at Random): Pattern depends on the missing value itself. (e.g., Users with very high income skip the field to protect privacy). Solution: Requires advanced domain modeling.

1. Missing Values

TypeAbbreviationDefinitionExampleImplication
Missing Completely at RandomMCARNo pattern — missingness is unrelated to any variableA respondent accidentally skipped a questionSafe to delete rows (no bias introduced)
Missing at RandomMARMissingness depends on observed variables, not the missing value itselfWomen are less likely to report age, but this is related to gender (observed), not age itselfUse imputation based on observed data
Missing Not at RandomMNARMissingness depends on the missing value itselfHigh-income earners skip the "Salary" field because their salary is highMost difficult — may need domain-specific models

Treatment Methods — Decision Guide:

MethodDescriptionWhen to UseProsCons
Listwise DeletionRemove entire rows with any missing dataMissing data < 5%, MCARSimple, preserves relationshipsLoses data, biased if not MCAR
Pairwise DeletionUse available data for each specific analysisModerate missingnessMaximizes available dataInconsistent sample sizes
Mean ImputationFill with column averageNormal numerical data, MCARSimple, fastReduces variance, distorts distribution
Median ImputationFill with column medianSkewed numerical data with outliersRobust to outliersStill distorts distribution
Mode ImputationFill with most frequent valueCategorical dataWorks for non-numeric dataCan overrepresent one category
Forward/Backward FillFill with next or previous valueTime-series dataMaintains temporal patternsCan propagate errors
KNN ImputationUse K-Nearest Neighbors to predict missing valueComplex patterns, sufficient dataConsiders relationships between featuresComputationally expensive
Multiple ImputationCreate multiple plausible values, average resultsResearch / clinical dataMost statistically rigorousComplex to implement

2. Outliers

Formal Definition: An outlier is a data observation that lies at an abnormal distance from other values in the sample. Statistically, it is a point that falls outside the expected range of the data distribution.

Example: Salaries: [40k, 42k, 45k, 1M, 43k] -> 1M is an outlier.

Types of Outliers:

  • Point Outlier: A single data point far from the rest (e.g., Age = 200 in a human dataset).
  • Contextual Outlier: Abnormal in a specific context (e.g., 30°C is normal in summer, outlier in winter).
  • Collective Outlier: A subset of data points that is anomalous as a group (e.g., sudden traffic spike on a server).

Detection Methods:

MethodFormula / RuleAssumptionBest For
Z-ScoreZ = (X - μ) / σ; Outlier ifZ> 3Data follows Normal DistributionNormally distributed data
IQRIQR = Q3 - Q1; Outlier if X < Q1 - 1.5IQR or X > Q3 + 1.5IQRNone (non-parametric)Skewed data, general-purpose
Modified Z-ScoreUses Median instead of Mean for robustnessNoneData with existing outliers
Isolation ForestML-based anomaly detectionNoneHigh-dimensional data, complex patterns

Worked Example (IQR Method): Data: [10, 15, 18, 20, 22, 25, 100]

  • Q1 (25th percentile) = 15
  • Q3 (75th percentile) = 25
  • IQR = 25 - 15 = 10
  • Lower Bound = 15 - 1.5(10) = 0
  • Upper Bound = 25 + 1.5(10) = 40
  • 100 > 40 → 100 is an outlier.

Treatment Decision Framework:

ActionWhen to ApplyExample
RemoveData entry error or impossible valueAge = -5, Temperature = 999°C
Cap/Floor (Winsorization)Replace extreme values with a threshold (e.g., 99th percentile)Capping salaries at 99th percentile for fair comparison
TransformApply log or square root transformation to reduce impactLog-transforming highly skewed income data
KeepThe outlier represents a genuine, meaningful rare eventFraud detection, rare disease identification
Separate AnalysisAnalyze outliers as a distinct groupVIP customers with abnormally high spending

3. Python Code: Handling Missing Values

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Age':    [25, 30, np.nan, 22, 35, np.nan],
    'Salary': [50000, 60000, 45000, np.nan, 70000, 55000],
    'Gender': ['M', 'F', 'M', np.nan, 'F', 'M']
})

# 1. Detect missing values
print(df.isnull().sum())           # Count per column
print(df.isnull().sum() / len(df)) # Percentage per column

# 2. Drop rows where > 50% values are missing
df_clean = df.dropna(thresh=len(df.columns) // 2)

# 3. Mean imputation (numeric columns)
df['Age'].fillna(df['Age'].mean(), inplace=True)

# 4. Median imputation (robust to outliers)
df['Salary'].fillna(df['Salary'].median(), inplace=True)

# 5. Mode imputation (categorical columns)
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)

4. Python Code: Outlier Detection with IQR

import pandas as pd

data = pd.Series([10, 15, 18, 20, 22, 25, 100])

# IQR Method
Q1 = data.quantile(0.25)  # 15
Q3 = data.quantile(0.75)  # 25
IQR = Q3 - Q1             # 10

lower = Q1 - 1.5 * IQR   # 0
upper = Q3 + 1.5 * IQR   # 40

outliers = data[(data < lower) | (data > upper)]
print("Outliers:", outliers.values)   # [100]

# Z-Score Method
from scipy import stats
z_scores = stats.zscore(data)
outliers_z = data[abs(z_scores) > 3]
print("Z-Score Outliers:", outliers_z.values)

# Capping / Winsorization (replace outliers at threshold)
data_capped = data.clip(lower=lower, upper=upper)
print("After Capping:", data_capped.values)  # 100 → 40

5. Exam-Ready Summary

ConceptFormulaDecision Rule
Z-Score OutlierZ = (X-μ)/σOutlier ifZ> 3
IQR Lower BoundQ1 − 1.5×IQROutlier if X < Lower Bound
IQR Upper BoundQ3 + 1.5×IQROutlier if X > Upper Bound
MCARNo pattern to missingnessSafe to delete rows
MARMissingness depends on other columnsImpute using those columns
MNARMissingness depends on hidden valueHardest — domain knowledge needed