Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

1.8 Categorization & Segmentation

Lesson 8 of 32 in the free Data Visualisation and Analytics notes on Siksha Sarovar, written by Rohit Jangra.

Categorization vs. Segmentation

1. Categorization

Categorization is the process of assigning data points into predefined, manually specified groups based on explicit rules.

Study Deep: The K-Means Convergence

The most popular segmentation algorithm, K-Means, is an iterative process:

  1. Initialize: Pick $K$ random starting points (centroids).
  2. Assign: Every data point "joins" its nearest centroid.
  3. Update: Each centroid moves to the center of its new members.
  4. Repeat: This continues until the centroids stop moving.
  • BCA Exam Tip: Always mention that K-Means is sensitive to the initial starting points and outliers!

1. Categorization

Formal Definition: Categorization is the process of assigning data points into predefined, manually specified groups based on explicit rules or criteria. The groups are known before looking at the data.

  • Type: Deductive (Rule-based, Top-down).
  • Learning Type: Supervised (labels are provided by humans).
  • Example: Grading System.
  • Rule: Marks > 90 = 'A', 80–90 = 'B', 70–80 = 'C'.
  • We define the bins first, then assign students to them.

More Examples of Categorization:

DomainCategorization RuleCategories
HealthcareBMI rangesUnderweight, Normal, Overweight, Obese
E-CommercePurchase amount thresholdsBronze (<₹1K), Silver (₹1K–₹5K), Gold (>₹5K)
EducationMarks rangesPass/Fail, Grade A/B/C/D
BankingCredit Score rangesPoor, Fair, Good, Excellent

2. Segmentation

Formal Definition: Segmentation is the process of discovering unknown, naturally occurring groups in data based on mathematical similarity or patterns. The groups are discovered from the data itself — not predefined by humans.

  • Type: Inductive (Pattern-based, Bottom-up).
  • Learning Type: Unsupervised (no labels provided).
  • Example: Customer Segmentation.
  • Algorithm analyzes purchase history, browsing behavior, and demographics.
  • Discovers three groups: "Budget Shoppers", "Tech Enthusiasts", and "Gift Buyers".
  • We didn't define these rules; the data revealed them.

3. Key Differences (Comprehensive)

FeatureCategorizationSegmentation
BasisPredefined Rules / ThresholdsMathematical Similarity (Distance/Density)
RoleClassification / SortingDiscovery / Clustering
Logic"I decide the groups" (Human-driven)"Data decides the groups" (Algorithm-driven)
Input RequiredRules + DataOnly Data
Number of GroupsKnown in advanceOften unknown (algorithm determines or user sets K)
AdaptabilityStatic — rules don't change with new dataDynamic — groups may shift as new data arrives
ExamplesAge Groups, File Types, GradingMarket Segments, Image Regions, Anomaly Groups
AlgorithmsIf-Else rules, Binning, Lookup tablesK-Means, DBSCAN, Hierarchical Clustering

4. Segmentation Algorithms (In Detail)

Since segmentation is about discovery, we use Unsupervised Machine Learning algorithms.

A. K-Means Clustering (Step-by-Step):

  1. Choose K: Decide how many clusters you want (e.g., K=3).
  2. Initialize Centroids: Randomly place K points in the data space.
  3. Assign Points: Each data point is assigned to the nearest centroid (using Euclidean distance).
  4. Update Centroids: Move each centroid to the mean of all points assigned to it.
  5. Repeat: Steps 3–4 until centroids stop moving (convergence).
PropertyK-Means
TypePartition-based
Shape of ClustersSpherical / Convex
Requires K?Yes (must specify number of clusters)
Sensitive to Outliers?Yes (mean is affected by outliers)
Choosing KUse the Elbow Method (plot inertia vs. K; elbow point = best K)

B. Hierarchical Clustering:

  • Builds a tree of clusters called a Dendrogram.
  • Two approaches: Agglomerative (bottom-up: each point starts as its own cluster, merge closest pairs) and Divisive (top-down: start with one cluster, split).
  • Advantage: No need to pre-specify K. You cut the dendrogram at the desired height.

C. DBSCAN (Density-Based):

  • Groups points that are closely packed together (high-density regions).
  • Points in low-density regions are labeled as noise/outliers.
  • Advantage: Can find arbitrarily shaped clusters; does NOT require K.
  • Parameters: eps (neighborhood radius) and min_samples (minimum points to form a cluster).

5. Distance Metrics

Clustering algorithms rely on measuring "distance" between data points:

MetricFormulaBest For
EuclideanStraight-line distance (Pythagoras)Continuous numerical data
ManhattanSum of absolute differences ("city block" distance)Grid-like or high-dimensional data
Cosine SimilarityAngle between two vectors (ignores magnitude)Text data, recommendation systems

6. Evaluating Cluster Quality

  • Silhouette Score: Measures how similar a point is to its own cluster vs. neighboring clusters. Ranges from -1 to +1.
  • +1: Perfectly clustered.
  • 0: On the boundary between two clusters.
  • -1: Assigned to the wrong cluster.
  • Inertia (Within-Cluster Sum of Squares): Lower is better. Used in the Elbow Method.