Categorization vs. Segmentation
1. Categorization
Categorization is the process of assigning data points into predefined, manually specified groups based on explicit rules.
Study Deep: The K-Means Convergence
The most popular segmentation algorithm, K-Means, is an iterative process:
- Initialize: Pick $K$ random starting points (centroids).
- Assign: Every data point "joins" its nearest centroid.
- Update: Each centroid moves to the center of its new members.
- Repeat: This continues until the centroids stop moving.
- BCA Exam Tip: Always mention that K-Means is sensitive to the initial starting points and outliers!
1. Categorization
Formal Definition: Categorization is the process of assigning data points into predefined, manually specified groups based on explicit rules or criteria. The groups are known before looking at the data.
- Type: Deductive (Rule-based, Top-down).
- Learning Type: Supervised (labels are provided by humans).
- Example: Grading System.
- Rule: Marks > 90 = 'A', 80–90 = 'B', 70–80 = 'C'.
- We define the bins first, then assign students to them.
More Examples of Categorization:
| Domain | Categorization Rule | Categories |
|---|---|---|
| Healthcare | BMI ranges | Underweight, Normal, Overweight, Obese |
| E-Commerce | Purchase amount thresholds | Bronze (<₹1K), Silver (₹1K–₹5K), Gold (>₹5K) |
| Education | Marks ranges | Pass/Fail, Grade A/B/C/D |
| Banking | Credit Score ranges | Poor, Fair, Good, Excellent |
2. Segmentation
Formal Definition: Segmentation is the process of discovering unknown, naturally occurring groups in data based on mathematical similarity or patterns. The groups are discovered from the data itself — not predefined by humans.
- Type: Inductive (Pattern-based, Bottom-up).
- Learning Type: Unsupervised (no labels provided).
- Example: Customer Segmentation.
- Algorithm analyzes purchase history, browsing behavior, and demographics.
- Discovers three groups: "Budget Shoppers", "Tech Enthusiasts", and "Gift Buyers".
- We didn't define these rules; the data revealed them.
3. Key Differences (Comprehensive)
| Feature | Categorization | Segmentation |
|---|---|---|
| Basis | Predefined Rules / Thresholds | Mathematical Similarity (Distance/Density) |
| Role | Classification / Sorting | Discovery / Clustering |
| Logic | "I decide the groups" (Human-driven) | "Data decides the groups" (Algorithm-driven) |
| Input Required | Rules + Data | Only Data |
| Number of Groups | Known in advance | Often unknown (algorithm determines or user sets K) |
| Adaptability | Static — rules don't change with new data | Dynamic — groups may shift as new data arrives |
| Examples | Age Groups, File Types, Grading | Market Segments, Image Regions, Anomaly Groups |
| Algorithms | If-Else rules, Binning, Lookup tables | K-Means, DBSCAN, Hierarchical Clustering |
4. Segmentation Algorithms (In Detail)
Since segmentation is about discovery, we use Unsupervised Machine Learning algorithms.
A. K-Means Clustering (Step-by-Step):
- Choose K: Decide how many clusters you want (e.g., K=3).
- Initialize Centroids: Randomly place K points in the data space.
- Assign Points: Each data point is assigned to the nearest centroid (using Euclidean distance).
- Update Centroids: Move each centroid to the mean of all points assigned to it.
- Repeat: Steps 3–4 until centroids stop moving (convergence).
| Property | K-Means |
|---|---|
| Type | Partition-based |
| Shape of Clusters | Spherical / Convex |
| Requires K? | Yes (must specify number of clusters) |
| Sensitive to Outliers? | Yes (mean is affected by outliers) |
| Choosing K | Use the Elbow Method (plot inertia vs. K; elbow point = best K) |
B. Hierarchical Clustering:
- Builds a tree of clusters called a Dendrogram.
- Two approaches: Agglomerative (bottom-up: each point starts as its own cluster, merge closest pairs) and Divisive (top-down: start with one cluster, split).
- Advantage: No need to pre-specify K. You cut the dendrogram at the desired height.
C. DBSCAN (Density-Based):
- Groups points that are closely packed together (high-density regions).
- Points in low-density regions are labeled as noise/outliers.
- Advantage: Can find arbitrarily shaped clusters; does NOT require K.
- Parameters:
eps(neighborhood radius) andmin_samples(minimum points to form a cluster).
5. Distance Metrics
Clustering algorithms rely on measuring "distance" between data points:
| Metric | Formula | Best For |
|---|---|---|
| Euclidean | Straight-line distance (Pythagoras) | Continuous numerical data |
| Manhattan | Sum of absolute differences ("city block" distance) | Grid-like or high-dimensional data |
| Cosine Similarity | Angle between two vectors (ignores magnitude) | Text data, recommendation systems |
6. Evaluating Cluster Quality
- Silhouette Score: Measures how similar a point is to its own cluster vs. neighboring clusters. Ranges from -1 to +1.
- +1: Perfectly clustered.
- 0: On the boundary between two clusters.
- -1: Assigned to the wrong cluster.
- Inertia (Within-Cluster Sum of Squares): Lower is better. Used in the Elbow Method.