Seaborn: Categorical Data & Aesthetics
1. Visualizing Categorical Data (catplot)
When one variable is a category (e.g., "Day of Week") and the other is numerical (e.g., "Total Bill").
| Plot Type | Description | When to Use | Strength |
|---|---|---|---|
| Bar Plot | Shows the Mean with confidence interval | Comparing averages between groups | Easy to interpret |
| Count Plot | Shows the Count of observations | Checking sample sizes | Simple frequency check |
| Box Plot | Shows Median, Quartiles, and Outliers | Robust comparison of distributions | Shows spread + outliers |
| Violin Plot | Combines Box Plot + KDE | Seeing distribution shape inside categories | Shows density + quartiles |
| Swarm Plot | Shows every single data point, no overlap | Small datasets, individual items | No information hidden |
| Strip Plot | Like Swarm but allows overlap (jittered) | Quick overview of individual values | Fast to render |
Study Deep: Box Plot vs. Violin Plot
While both show the distribution of data across categories:
- Box Plot: Focuses on the "5-number summary" (Min, Q1, Median, Q3, Max) and identifies outliers. It's clean and efficient for comparing many categories at once.
- Violin Plot: Adds a Kernel Density Estimation (KDE) to the box plot. This allows you to see the "shape" or "density" of the data. If your data is bimodal (has two peaks), a box plot will hide this, but a violin plot will reveal it clearly.
Code Example:
# Box plot of Bill Amount by Day, split by Smoker status
sns.catplot(data=tips, x="day", y="total_bill", hue="smoker", kind="box")
2. Categorical Plot Selection Guide
| Your Goal | Best Plot | Why |
|---|---|---|
| Compare means across groups | Bar Plot | Clear height comparison |
| See full distribution per group | Box Plot / Violin | Shows quartiles, outliers, shape |
| See every individual data point | Swarm / Strip | Nothing hidden |
| Count occurrences per category | Count Plot | Simple frequency |
| Compare means + see distribution | Violin + overlay Strip | Combines both views |
3. Visualizing Relationships: Lmplot
sns.lmplot() is a powerhouse. It draws a scatter plot AND fits a linear regression line with a 95% confidence interval shaded.
sns.lmplot(x="age", y="wage", data=df)- Interpretation: If the shaded area is narrow, the correlation estimate is precise.
- Use
hue="category"to split by groups and compare regression lines.
4. Heatmaps
Ideal for correlation matrices, confusion matrices, and any 2D matrix data:
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0)
# annot=True: Show numbers in cells
# cmap: Color palette
# center=0: Center colorbar at 0 (for diverging data)
5. Aesthetics: Color Palettes
Seaborn's color handling is superior for communicating patterns.
| Palette Type | Function | Use Case | Examples |
|---|---|---|---|
| Qualitative | palette="deep" | Distinct categories (Apple, Banana, Orange) | deep, pastel, bright, Set2 |
| Sequential | palette="viridis" | Low to High values (Income, Temperature) | viridis, rocket, Blues, YlOrRd |
| Diverging | palette="vlag" | Centered on zero (Profit/Loss, Vote shift) | vlag, coolwarm, icefire, RdBu |
Choosing a Palette:
- Categorical data → Qualitative (distinct, unrelated colors)
- Ordered data → Sequential (light-to-dark gradient)
- Data centered on a midpoint → Diverging (two contrasting colors meeting in the middle)
6. Styles and Contexts
- Styles: Control background and grid aesthetics.
sns.set_style("whitegrid")(alsodarkgrid,white,dark,ticks).- Context: Scale elements for different output media.
sns.set_context("talk")— Large fonts/lines for presentations.sns.set_context("paper")— Smaller fonts for printed reports.sns.set_context("poster")— Very large for poster presentations.