Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

Handling Missing Values

Lesson 33 of 37 in the free Data Science notes on Siksha Sarovar, written by Rohit Jangra.

Handling Missing Values

Missing values are one of the most pervasive data quality issues in real-world datasets. How you handle them can significantly impact the accuracy and reliability of your analysis and models.

Formal Definition

A missing value (also called a null value, NaN, or NA) is a data point where no value is stored for the observation in a particular variable. In Python and Pandas, missing values are typically represented as NaN (Not a Number) or None.

---

Why Values Go Missing

CauseDescriptionExample
Human ErrorManual data entry mistakesForgetting to fill a form field
System FailureSensor malfunction or software bugsTemperature sensor offline for 2 hours
Survey DesignRespondent skips optional questions"Prefer not to answer" on income
Data MergingJoining datasets with mismatched keysLeft join creates NaN for unmatched rows
Privacy RestrictionsSensitive data intentionally withheldMedical records with redacted fields

---

Types of Missing Data

Understanding why data is missing is crucial for choosing the right strategy:

TypeFull NameDescriptionExample
MCARMissing Completely at RandomMissingness has no relationship to any variableRandom sensor glitch
MARMissing at RandomMissingness depends on observed variables but not the missing value itselfMen less likely to report weight
MNARMissing Not at RandomMissingness depends on the missing value itselfHigh-income people refusing to report income

Why This Matters:

  • MCAR: Safe to delete rows — no bias introduced.
  • MAR: Imputation methods are appropriate.
  • MNAR: Most challenging — requires domain knowledge or specialized models.

---

Detecting Missing Values

import pandas as pd
import numpy as np

df = pd.read_csv("data.csv")

# Total missing values per column
print(df.isnull().sum())

# Percentage of missing values per column
print((df.isnull().sum() / len(df)) * 100)

# Heatmap visualization of missing values
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.isnull(), cbar=True, yticklabels=False, cmap="viridis")
plt.title("Missing Value Heatmap")
plt.show()

---

Strategies for Handling Missing Values

1. Deletion Methods

a) Listwise Deletion (Dropping Rows)

Remove entire rows that contain any missing value.

# Drop all rows with any NaN
df_clean = df.dropna()

# Drop rows where specific columns have NaN
df_clean = df.dropna(subset=["age", "income"])

When to use: When the percentage of missing data is very small (< 5%) and data is MCAR.

b) Column Deletion (Dropping Columns)

Remove entire columns that have too many missing values.

# Drop columns with more than 50% missing values
threshold = 0.5
df_clean = df.loc[:, df.isnull().mean() < threshold]

When to use: When a column has >50% missing values and is not critical for analysis.

---

2. Imputation Methods (Filling Missing Values)

a) Mean/Median/Mode Imputation

# Fill with mean (for normally distributed numerical data)
df["age"].fillna(df["age"].mean(), inplace=True)

# Fill with median (for skewed numerical data — more robust)
df["income"].fillna(df["income"].median(), inplace=True)

# Fill with mode (for categorical data)
df["city"].fillna(df["city"].mode()[0], inplace=True)

b) Forward Fill and Backward Fill (for time series)

# Forward fill — carry last known value forward
df["temperature"].fillna(method="ffill", inplace=True)

# Backward fill — use next known value
df["temperature"].fillna(method="bfill", inplace=True)

c) Constant Value Imputation

# Fill with a constant value
df["status"].fillna("Unknown", inplace=True)
df["score"].fillna(0, inplace=True)

d) K-Nearest Neighbors (KNN) Imputation

Uses the values of similar records to predict missing values.

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(
    imputer.fit_transform(df.select_dtypes(include=[np.number])),
    columns=df.select_dtypes(include=[np.number]).columns
)

e) Iterative Imputation (MICE - Multiple Imputation by Chained Equations)

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer(max_iter=10, random_state=42)
df_imputed = pd.DataFrame(
    imputer.fit_transform(df.select_dtypes(include=[np.number])),
    columns=df.select_dtypes(include=[np.number]).columns
)

---

Comparison of Imputation Strategies

MethodBest ForProsCons
DeletionSmall % of missing, MCARSimple, no bias if MCARLoses data, biased if not MCAR
MeanNormally distributed numeric dataEasy to implementDistorts variance, sensitive to outliers
MedianSkewed numeric dataRobust to outliersIgnores feature relationships
ModeCategorical dataSimple, preserves categoriesOver-represents dominant category
Forward/Backward FillTime-series dataPreserves temporal patternsNot suitable for non-sequential data
KNNModerately missing dataUses inter-feature relationshipsComputationally expensive for large data
MICEComplex multivariate dataMost sophisticated, handles MARSlow, complex to tune

---

Summary

  • Missing values arise from human error, system failures, survey design, and data merging.
  • Understanding the type of missingness (MCAR, MAR, MNAR) guides the correct approach.
  • Deletion is simple but can lose information; imputation preserves data but introduces assumptions.
  • Advanced methods like KNN and MICE use relationships between features for more accurate imputation.
  • Always validate the impact of your chosen strategy on downstream analysis.