Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

Data Cleaning

Lesson 32 of 37 in the free Data Science notes on Siksha Sarovar, written by Rohit Jangra.

Data Cleaning

Data Cleaning (also called Data Cleansing or Data Scrubbing) is the process of identifying and correcting (or removing) errors, inconsistencies, and inaccuracies in a dataset to improve its quality. It is the first and most critical step in data preparation.

Formal Definition

Data Cleaning is the systematic process of detecting and resolving corrupt, inaccurate, irrelevant, or incomplete records within a dataset. The goal is to produce a dataset that is consistent, accurate, and suitable for downstream analysis and modeling.

---

Why Data Cleaning Matters

  • Garbage In, Garbage Out (GIGO): If you build a model on dirty data, the predictions will be unreliable — regardless of how sophisticated the algorithm is.
  • Real-world data is messy: Surveys have typos, sensors malfunction, databases merge incorrectly, and users enter data inconsistently.
  • Business impact: A study by IBM estimated that poor data quality costs the US economy approximately $3.1 trillion annually.

---

Common Data Quality Issues

IssueDescriptionExample
Missing ValuesEmpty cells or NaN entriesCustomer age field left blank
Duplicate RecordsSame record appearing multiple timesSame order logged twice
Inconsistent FormattingSame data in different formats"Male", "M", "male", "MALE"
Incorrect Data TypesWrong type assigned to a columnAge stored as string "25" instead of integer 25
Typographical ErrorsMisspellings or data entry mistakes"Bangalroe" instead of "Bangalore"
Invalid ValuesValues outside logical rangeAge = -5, Temperature = 999°C
Structural ErrorsImproper column naming, mixed categoriesColumn named "col1" with no description
Irrelevant DataColumns not useful for analysisInternal system IDs in a customer analysis

---

Data Cleaning Workflow

Step 1: Load Data → df = pd.read_csv("data.csv")
Step 2: Inspect → df.info(), df.describe(), df.head()
Step 3: Handle Missing Values → fillna(), dropna()
Step 4: Remove Duplicates → df.drop_duplicates()
Step 5: Fix Data Types → df["col"] = df["col"].astype(int)
Step 6: Standardize Text → df["city"] = df["city"].str.lower().str.strip()
Step 7: Handle Outliers → IQR method, Z-score
Step 8: Validate → Final checks & assertions

---

Data Cleaning with Pandas — Practical Examples

1. Loading and Inspecting Data

import pandas as pd
import numpy as np

# Load data
df = pd.read_csv("students.csv")

# Quick inspection
print(df.shape)          # (rows, columns)
print(df.info())         # Data types, non-null counts
print(df.describe())     # Statistical summary
print(df.head(10))       # First 10 rows
print(df.isnull().sum()) # Missing values per column

2. Removing Duplicate Records

# Check for duplicates
print(f"Duplicates: {df.duplicated().sum()}")

# Remove duplicates
df = df.drop_duplicates()

# Remove duplicates based on specific columns
df = df.drop_duplicates(subset=["name", "email"], keep="first")

3. Fixing Inconsistent Text Data

# Standardize text columns
df["city"] = df["city"].str.lower().str.strip()
df["gender"] = df["gender"].replace({
    "M": "Male", "m": "Male", "male": "Male",
    "F": "Female", "f": "Female", "female": "Female"
})

4. Correcting Data Types

# Convert string to datetime
df["date_of_birth"] = pd.to_datetime(df["date_of_birth"], errors="coerce")

# Convert string to numeric
df["income"] = pd.to_numeric(df["income"], errors="coerce")

# Convert to category type for memory efficiency
df["department"] = df["department"].astype("category")

5. Removing Invalid Values

# Remove rows where age is negative or unrealistically high
df = df[(df["age"] >= 0) & (df["age"] <= 120)]

# Remove rows where salary is negative
df = df[df["salary"] > 0]

6. Dropping Irrelevant Columns

# Drop columns not needed for analysis
df = df.drop(columns=["internal_id", "row_number", "unnamed_0"])

---

Data Quality Checklist

CheckMethodPandas Code
ShapeRow/column countdf.shape
Data TypesVerify correct typesdf.dtypes
Missing ValuesCount nullsdf.isnull().sum()
DuplicatesCount duplicatesdf.duplicated().sum()
Unique ValuesInspect categoriesdf["col"].value_counts()
Statistical RangeCheck min/maxdf.describe()
Sample RowsVisual inspectiondf.sample(10)

---

Summary

  • Data Cleaning is the most time-consuming but most critical step in data science.
  • Common issues include missing values, duplicates, inconsistent formatting, wrong types, and outliers.
  • Pandas provides a comprehensive toolkit for cleaning tabular data.
  • Always validate cleaned data before proceeding to analysis or modeling.