Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is the process of analyzing datasets to summarize their main characteristics, often using visual methods. It is the detective work of data science — uncovering hidden patterns, spotting anomalies, testing assumptions, and generating hypotheses before any formal modeling begins.
Formal Definition
EDA is an approach to analyzing data sets to discover patterns, spot anomalies, formulate hypotheses, and check assumptions using summary statistics and graphical representations. The concept was popularized by John Tukey in his 1977 book "Exploratory Data Analysis."
---
Why EDA is Essential
- Provides a deep understanding of the data before modeling.
- Reveals data quality issues (missing values, outliers, inconsistencies).
- Identifies relationships between variables.
- Helps in feature selection — which variables matter most.
- Prevents modeling mistakes — you cannot build a good model on data you do not understand.
- Generates hypotheses that can be tested statistically.
---
Types of EDA
| Type | Description | Tools |
|---|---|---|
| Univariate | Analyze one variable at a time | Histograms, Box Plots, Value Counts |
| Bivariate | Analyze relationship between two variables | Scatter Plots, Correlation, Bar Charts |
| Multivariate | Analyze interactions among three or more variables | Pair Plots, Heatmaps, 3D Plots |
---
Step-by-Step EDA Workflow
Step 1: Data Overview
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("data.csv")
# Basic info
print(df.shape) # (rows, columns)
print(df.info()) # Data types, non-null counts
print(df.describe()) # Descriptive statistics
print(df.describe(include="object")) # For categorical columns
print(df.head())
---
Step 2: Univariate Analysis — Understanding Individual Variables
a) Numerical Variables
# Histogram — shows distribution shape
df["age"].hist(bins=30, edgecolor="black")
plt.title("Age Distribution")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()
# Box Plot — shows spread, median, and outliers
sns.boxplot(x=df["salary"])
plt.title("Salary Box Plot")
plt.show()
# Summary statistics
print(df["age"].describe())
print(f"Skewness: {df['age'].skew():.2f}")
print(f"Kurtosis: {df['age'].kurtosis():.2f}")
b) Categorical Variables
# Value counts
print(df["department"].value_counts())
# Bar chart
df["department"].value_counts().plot(kind="bar", color="teal", edgecolor="black")
plt.title("Department Distribution")
plt.ylabel("Count")
plt.show()
# Pie chart
df["gender"].value_counts().plot(kind="pie", autopct="%1.1f%%", startangle=90)
plt.title("Gender Distribution")
plt.ylabel("")
plt.show()
---
Step 3: Bivariate Analysis — Relationships Between Two Variables
a) Numerical vs Numerical
# Scatter plot
plt.scatter(df["experience"], df["salary"], alpha=0.5)
plt.xlabel("Experience (years)")
plt.ylabel("Salary")
plt.title("Experience vs Salary")
plt.show()
# Correlation coefficient
corr = df["experience"].corr(df["salary"])
print(f"Pearson Correlation: {corr:.3f}")
b) Numerical vs Categorical
# Box plot by category
sns.boxplot(x="department", y="salary", data=df)
plt.title("Salary by Department")
plt.xticks(rotation=45)
plt.show()
# Violin plot — richer view of distribution
sns.violinplot(x="department", y="salary", data=df)
plt.title("Salary Distribution by Department")
plt.xticks(rotation=45)
plt.show()
c) Categorical vs Categorical
# Cross-tabulation
ct = pd.crosstab(df["department"], df["gender"])
print(ct)
# Stacked bar chart
ct.plot(kind="bar", stacked=True)
plt.title("Department by Gender")
plt.ylabel("Count")
plt.show()
---
Step 4: Multivariate Analysis — Discovering Complex Patterns
a) Correlation Heatmap
# Correlation matrix for all numerical features
corr_matrix = df.select_dtypes(include=[np.number]).corr()
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", center=0, fmt=".2f",
linewidths=0.5, square=True)
plt.title("Feature Correlation Heatmap")
plt.tight_layout()
plt.show()
b) Pair Plot
# Pair plot — scatter + distribution matrix for selected features
sns.pairplot(df[["age", "salary", "experience", "department"]], hue="department",
diag_kind="kde")
plt.suptitle("Pair Plot", y=1.02)
plt.show()
c) Grouped Aggregation
# Average salary by department and gender
grouped = df.groupby(["department", "gender"])["salary"].mean().unstack()
grouped.plot(kind="bar", figsize=(10, 6))
plt.title("Average Salary by Department and Gender")
plt.ylabel("Average Salary")
plt.show()
---
Key EDA Visualizations Summary
| Plot Type | Best For | Library |
|---|---|---|
| Histogram | Distribution of a single numeric variable | Matplotlib / Seaborn |
| Box Plot | Spread, median, and outliers | Seaborn |
| Scatter Plot | Relationship between two numeric variables | Matplotlib |
| Bar Chart | Counts/frequencies of categorical variables | Matplotlib / Seaborn |
| Heatmap | Correlation between all numeric features | Seaborn |
| Pair Plot | Pairwise relationships in a dataset | Seaborn |
| Violin Plot | Distribution shape + box plot combined | Seaborn |
| Pie Chart | Proportions of categorical data | Matplotlib |
| Count Plot | Frequency of categories | Seaborn |
| KDE Plot | Smooth density estimate of distribution | Seaborn |
---
EDA Interpretation Guidelines
| Observation | Implication | Action |
|---|---|---|
| High correlation (r > 0.8) between features | Multicollinearity risk | Remove one of the correlated features |
| Highly skewed distribution | May violate model assumptions | Apply log or Box-Cox transformation |
| Many outliers in box plot | Potential errors or genuine extremes | Investigate and handle appropriately |
| Class imbalance in target | Model may be biased toward majority class | Apply SMOTE, undersampling, or class weights |
| Missing values pattern | May indicate systematic issues | Choose appropriate imputation strategy |
| Clear clusters in scatter plot | Natural groupings in data | Consider clustering algorithms |
---
Summary
- EDA is the first analytical step — understand your data before modeling.
- Univariate analysis explores individual variables; bivariate explores relationships; multivariate reveals complex interactions.
- Visualization is the primary tool of EDA — histograms, box plots, scatter plots, heatmaps, and pair plots.
- EDA reveals data quality issues, suggests features, and informs model selection.
- Every data science project should begin with thorough EDA — it saves time and prevents costly modeling errors downstream.