Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

Statistics: Mean, Median, Mode, Variance & SD

Lesson 13 of 37 in the free Data Science notes on Siksha Sarovar, written by Rohit Jangra.

Descriptive Statistics

Statistics is the science of collecting, analyzing, interpreting, and presenting data. Descriptive Statistics summarizes and describes the main features of a dataset. It is the first step in any data analysis — before building models, you must understand your data.

---

Measures of Central Tendency

These measures tell you where the center of the data is — a single value that represents the "typical" data point.

1. Mean (Average)

Definition: The sum of all values divided by the number of values.

Formula: μ = (1/n) × Σᵢ₌₁ⁿ xᵢ

Example: Data: [10, 20, 30, 40, 50] Mean = (10 + 20 + 30 + 40 + 50) / 5 = 150 / 5 = 30

Properties:

  • Uses every data point in the calculation.
  • Sensitive to outliers. One extreme value can shift the mean dramatically.

Example of Outlier Effect: Data: [10, 20, 30, 40, 500] Mean = 600 / 5 = 120 (Drastically shifted by the outlier 500!)

---

2. Median

Definition: The middle value when data is arranged in ascending order. If there is an even number of values, the median is the average of the two middle values.

Steps:

  1. Sort the data.
  2. If n is odd: Median = Middle value.
  3. If n is even: Median = Average of the two middle values.

Example (Odd): Data: [10, 20, 30, 40, 50] → Median = 30

Example (Even): Data: [10, 20, 30, 40] → Median = (20 + 30) / 2 = 25

Properties:

  • Not affected by outliers. This is why median income is often preferred over mean income.
  • Recommended for skewed distributions.

---

3. Mode

Definition: The value that appears most frequently in a dataset.

Example: Data: [10, 20, 20, 30, 30, 30, 40] → Mode = 30 (appears 3 times)

Types:

  • Unimodal: One mode.
  • Bimodal: Two modes.
  • Multimodal: More than two modes.
  • No mode: If all values appear with equal frequency.

Use in Data Science:

  • Most useful for categorical data (e.g., most popular product color).
  • Can be used for imputing missing categorical values.

---

Comparison of Central Tendency Measures

MeasureBest ForOutlier Sensitive?Data Type
MeanSymmetric distributions✅ Yes (highly)Numerical
MedianSkewed distributions❌ No (robust)Numerical
ModeCategorical data❌ NoNumerical & Categorical

---

Measures of Dispersion (Spread)

Central tendency tells you the center, but two datasets can have the same mean and look completely different. Dispersion measures tell you how spread out the data is.

4. Range

Definition: The difference between the maximum and minimum values. Range = Max - Min

Example: Data: [10, 20, 30, 40, 50] → Range = 50 - 10 = 40

Limitation: Extremely sensitive to outliers; only uses two values.

---

5. Variance (σ²)

Definition: Variance measures the average squared deviation from the mean. It tells you how far each data point is from the mean, on average.

Population Variance Formula: σ² = (1/N) × Σᵢ₌₁ᴺ (xᵢ - μ)²

Sample Variance Formula (Bessel's Correction): s² = (1/(n-1)) × Σᵢ₌₁ⁿ (xᵢ - x̄)²

Why squared?

  • If we just summed the deviations (xáµ¢ - μ), positives and negatives would cancel out and give zero.
  • Squaring ensures all deviations are positive.

Example: Data: [2, 4, 6, 8, 10], Mean = 6 Deviations: [-4, -2, 0, 2, 4] Squared Deviations: [16, 4, 0, 4, 16] Variance = (16 + 4 + 0 + 4 + 16) / 5 = 8

---

6. Standard Deviation (σ)

Definition: The square root of the variance. It has the same unit as the original data, making it more interpretable than variance.

σ = √(σ²)

From the example above: σ = √8 ≈ 2.83

Interpretation:

  • A low SD means data points are close to the mean (consistent data).
  • A high SD means data points are spread out (variable data).

---

Variance vs Standard Deviation

FeatureVariance (σ²)Standard Deviation (σ)
UnitSquared units (e.g., kg²)Same as data (e.g., kg)
InterpretabilityLess intuitiveMore intuitive
Use in MLUsed in formulas (e.g., Normal Distribution)Used for data description
SensitivitySensitive to outliersSensitive to outliers

---

The Normal Distribution & Standard Deviation

The Normal (Gaussian) Distribution is defined by its mean (μ) and standard deviation (σ):

The 68-95-99.7 Rule (Empirical Rule):

RangePercentage of Data
μ ± 1σ~68%
μ ± 2σ~95%
μ ± 3σ~99.7%

This means in a normally distributed dataset, almost all data falls within 3 standard deviations of the mean.

Summary

  • Mean, Median, and Mode describe the center of data; each has different strengths.
  • Median is preferred for skewed data; Mean for symmetric data.
  • Variance and Standard Deviation measure data spread.
  • Standard Deviation is more interpretable as it uses the same units as the data.
  • The 68-95-99.7 rule links Standard Deviation to the Normal Distribution.