Data Cleaning
Data Cleaning (also called Data Cleansing or Data Scrubbing) is the process of identifying and correcting (or removing) errors, inconsistencies, and inaccuracies in a dataset to improve its quality. It is the first and most critical step in data preparation.
Formal Definition
Data Cleaning is the systematic process of detecting and resolving corrupt, inaccurate, irrelevant, or incomplete records within a dataset. The goal is to produce a dataset that is consistent, accurate, and suitable for downstream analysis and modeling.
---
Why Data Cleaning Matters
- Garbage In, Garbage Out (GIGO): If you build a model on dirty data, the predictions will be unreliable — regardless of how sophisticated the algorithm is.
- Real-world data is messy: Surveys have typos, sensors malfunction, databases merge incorrectly, and users enter data inconsistently.
- Business impact: A study by IBM estimated that poor data quality costs the US economy approximately $3.1 trillion annually.
---
Common Data Quality Issues
| Issue | Description | Example |
|---|---|---|
| Missing Values | Empty cells or NaN entries | Customer age field left blank |
| Duplicate Records | Same record appearing multiple times | Same order logged twice |
| Inconsistent Formatting | Same data in different formats | "Male", "M", "male", "MALE" |
| Incorrect Data Types | Wrong type assigned to a column | Age stored as string "25" instead of integer 25 |
| Typographical Errors | Misspellings or data entry mistakes | "Bangalroe" instead of "Bangalore" |
| Invalid Values | Values outside logical range | Age = -5, Temperature = 999°C |
| Structural Errors | Improper column naming, mixed categories | Column named "col1" with no description |
| Irrelevant Data | Columns not useful for analysis | Internal system IDs in a customer analysis |
---
Data Cleaning Workflow
Step 1: Load Data → df = pd.read_csv("data.csv")
Step 2: Inspect → df.info(), df.describe(), df.head()
Step 3: Handle Missing Values → fillna(), dropna()
Step 4: Remove Duplicates → df.drop_duplicates()
Step 5: Fix Data Types → df["col"] = df["col"].astype(int)
Step 6: Standardize Text → df["city"] = df["city"].str.lower().str.strip()
Step 7: Handle Outliers → IQR method, Z-score
Step 8: Validate → Final checks & assertions
---
Data Cleaning with Pandas — Practical Examples
1. Loading and Inspecting Data
import pandas as pd
import numpy as np
# Load data
df = pd.read_csv("students.csv")
# Quick inspection
print(df.shape) # (rows, columns)
print(df.info()) # Data types, non-null counts
print(df.describe()) # Statistical summary
print(df.head(10)) # First 10 rows
print(df.isnull().sum()) # Missing values per column
2. Removing Duplicate Records
# Check for duplicates
print(f"Duplicates: {df.duplicated().sum()}")
# Remove duplicates
df = df.drop_duplicates()
# Remove duplicates based on specific columns
df = df.drop_duplicates(subset=["name", "email"], keep="first")
3. Fixing Inconsistent Text Data
# Standardize text columns
df["city"] = df["city"].str.lower().str.strip()
df["gender"] = df["gender"].replace({
"M": "Male", "m": "Male", "male": "Male",
"F": "Female", "f": "Female", "female": "Female"
})
4. Correcting Data Types
# Convert string to datetime
df["date_of_birth"] = pd.to_datetime(df["date_of_birth"], errors="coerce")
# Convert string to numeric
df["income"] = pd.to_numeric(df["income"], errors="coerce")
# Convert to category type for memory efficiency
df["department"] = df["department"].astype("category")
5. Removing Invalid Values
# Remove rows where age is negative or unrealistically high
df = df[(df["age"] >= 0) & (df["age"] <= 120)]
# Remove rows where salary is negative
df = df[df["salary"] > 0]
6. Dropping Irrelevant Columns
# Drop columns not needed for analysis
df = df.drop(columns=["internal_id", "row_number", "unnamed_0"])
---
Data Quality Checklist
| Check | Method | Pandas Code |
|---|---|---|
| Shape | Row/column count | df.shape |
| Data Types | Verify correct types | df.dtypes |
| Missing Values | Count nulls | df.isnull().sum() |
| Duplicates | Count duplicates | df.duplicated().sum() |
| Unique Values | Inspect categories | df["col"].value_counts() |
| Statistical Range | Check min/max | df.describe() |
| Sample Rows | Visual inspection | df.sample(10) |
---
Summary
- Data Cleaning is the most time-consuming but most critical step in data science.
- Common issues include missing values, duplicates, inconsistent formatting, wrong types, and outliers.
- Pandas provides a comprehensive toolkit for cleaning tabular data.
- Always validate cleaned data before proceeding to analysis or modeling.