Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

Exploratory Data Analysis (EDA)

Lesson 37 of 37 in the free Data Science notes on Siksha Sarovar, written by Rohit Jangra.

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the process of analyzing datasets to summarize their main characteristics, often using visual methods. It is the detective work of data science — uncovering hidden patterns, spotting anomalies, testing assumptions, and generating hypotheses before any formal modeling begins.

Formal Definition

EDA is an approach to analyzing data sets to discover patterns, spot anomalies, formulate hypotheses, and check assumptions using summary statistics and graphical representations. The concept was popularized by John Tukey in his 1977 book "Exploratory Data Analysis."

---

Why EDA is Essential

  • Provides a deep understanding of the data before modeling.
  • Reveals data quality issues (missing values, outliers, inconsistencies).
  • Identifies relationships between variables.
  • Helps in feature selection — which variables matter most.
  • Prevents modeling mistakes — you cannot build a good model on data you do not understand.
  • Generates hypotheses that can be tested statistically.

---

Types of EDA

TypeDescriptionTools
UnivariateAnalyze one variable at a timeHistograms, Box Plots, Value Counts
BivariateAnalyze relationship between two variablesScatter Plots, Correlation, Bar Charts
MultivariateAnalyze interactions among three or more variablesPair Plots, Heatmaps, 3D Plots

---

Step-by-Step EDA Workflow

Step 1: Data Overview

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("data.csv")

# Basic info
print(df.shape)            # (rows, columns)
print(df.info())           # Data types, non-null counts
print(df.describe())       # Descriptive statistics
print(df.describe(include="object"))  # For categorical columns
print(df.head())

---

Step 2: Univariate Analysis — Understanding Individual Variables

a) Numerical Variables

# Histogram — shows distribution shape
df["age"].hist(bins=30, edgecolor="black")
plt.title("Age Distribution")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()

# Box Plot — shows spread, median, and outliers
sns.boxplot(x=df["salary"])
plt.title("Salary Box Plot")
plt.show()

# Summary statistics
print(df["age"].describe())
print(f"Skewness: {df['age'].skew():.2f}")
print(f"Kurtosis: {df['age'].kurtosis():.2f}")

b) Categorical Variables

# Value counts
print(df["department"].value_counts())

# Bar chart
df["department"].value_counts().plot(kind="bar", color="teal", edgecolor="black")
plt.title("Department Distribution")
plt.ylabel("Count")
plt.show()

# Pie chart
df["gender"].value_counts().plot(kind="pie", autopct="%1.1f%%", startangle=90)
plt.title("Gender Distribution")
plt.ylabel("")
plt.show()

---

Step 3: Bivariate Analysis — Relationships Between Two Variables

a) Numerical vs Numerical

# Scatter plot
plt.scatter(df["experience"], df["salary"], alpha=0.5)
plt.xlabel("Experience (years)")
plt.ylabel("Salary")
plt.title("Experience vs Salary")
plt.show()

# Correlation coefficient
corr = df["experience"].corr(df["salary"])
print(f"Pearson Correlation: {corr:.3f}")

b) Numerical vs Categorical

# Box plot by category
sns.boxplot(x="department", y="salary", data=df)
plt.title("Salary by Department")
plt.xticks(rotation=45)
plt.show()

# Violin plot — richer view of distribution
sns.violinplot(x="department", y="salary", data=df)
plt.title("Salary Distribution by Department")
plt.xticks(rotation=45)
plt.show()

c) Categorical vs Categorical

# Cross-tabulation
ct = pd.crosstab(df["department"], df["gender"])
print(ct)

# Stacked bar chart
ct.plot(kind="bar", stacked=True)
plt.title("Department by Gender")
plt.ylabel("Count")
plt.show()

---

Step 4: Multivariate Analysis — Discovering Complex Patterns

a) Correlation Heatmap

# Correlation matrix for all numerical features
corr_matrix = df.select_dtypes(include=[np.number]).corr()

plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", center=0, fmt=".2f",
            linewidths=0.5, square=True)
plt.title("Feature Correlation Heatmap")
plt.tight_layout()
plt.show()

b) Pair Plot

# Pair plot — scatter + distribution matrix for selected features
sns.pairplot(df[["age", "salary", "experience", "department"]], hue="department",
             diag_kind="kde")
plt.suptitle("Pair Plot", y=1.02)
plt.show()

c) Grouped Aggregation

# Average salary by department and gender
grouped = df.groupby(["department", "gender"])["salary"].mean().unstack()
grouped.plot(kind="bar", figsize=(10, 6))
plt.title("Average Salary by Department and Gender")
plt.ylabel("Average Salary")
plt.show()

---

Key EDA Visualizations Summary

Plot TypeBest ForLibrary
HistogramDistribution of a single numeric variableMatplotlib / Seaborn
Box PlotSpread, median, and outliersSeaborn
Scatter PlotRelationship between two numeric variablesMatplotlib
Bar ChartCounts/frequencies of categorical variablesMatplotlib / Seaborn
HeatmapCorrelation between all numeric featuresSeaborn
Pair PlotPairwise relationships in a datasetSeaborn
Violin PlotDistribution shape + box plot combinedSeaborn
Pie ChartProportions of categorical dataMatplotlib
Count PlotFrequency of categoriesSeaborn
KDE PlotSmooth density estimate of distributionSeaborn

---

EDA Interpretation Guidelines

ObservationImplicationAction
High correlation (r > 0.8) between featuresMulticollinearity riskRemove one of the correlated features
Highly skewed distributionMay violate model assumptionsApply log or Box-Cox transformation
Many outliers in box plotPotential errors or genuine extremesInvestigate and handle appropriately
Class imbalance in targetModel may be biased toward majority classApply SMOTE, undersampling, or class weights
Missing values patternMay indicate systematic issuesChoose appropriate imputation strategy
Clear clusters in scatter plotNatural groupings in dataConsider clustering algorithms

---

Summary

  • EDA is the first analytical step — understand your data before modeling.
  • Univariate analysis explores individual variables; bivariate explores relationships; multivariate reveals complex interactions.
  • Visualization is the primary tool of EDA — histograms, box plots, scatter plots, heatmaps, and pair plots.
  • EDA reveals data quality issues, suggests features, and informs model selection.
  • Every data science project should begin with thorough EDA — it saves time and prevents costly modeling errors downstream.