Descriptive Statistics
Statistics is the science of collecting, analyzing, interpreting, and presenting data. Descriptive Statistics summarizes and describes the main features of a dataset. It is the first step in any data analysis — before building models, you must understand your data.
---
Measures of Central Tendency
These measures tell you where the center of the data is — a single value that represents the "typical" data point.
1. Mean (Average)
Definition: The sum of all values divided by the number of values.
Formula: μ = (1/n) × Σᵢ₌â‚â¿ xáµ¢
Example: Data: [10, 20, 30, 40, 50] Mean = (10 + 20 + 30 + 40 + 50) / 5 = 150 / 5 = 30
Properties:
- Uses every data point in the calculation.
- Sensitive to outliers. One extreme value can shift the mean dramatically.
Example of Outlier Effect: Data: [10, 20, 30, 40, 500] Mean = 600 / 5 = 120 (Drastically shifted by the outlier 500!)
---
2. Median
Definition: The middle value when data is arranged in ascending order. If there is an even number of values, the median is the average of the two middle values.
Steps:
- Sort the data.
- If n is odd: Median = Middle value.
- If n is even: Median = Average of the two middle values.
Example (Odd): Data: [10, 20, 30, 40, 50] → Median = 30
Example (Even): Data: [10, 20, 30, 40] → Median = (20 + 30) / 2 = 25
Properties:
- Not affected by outliers. This is why median income is often preferred over mean income.
- Recommended for skewed distributions.
---
3. Mode
Definition: The value that appears most frequently in a dataset.
Example: Data: [10, 20, 20, 30, 30, 30, 40] → Mode = 30 (appears 3 times)
Types:
- Unimodal: One mode.
- Bimodal: Two modes.
- Multimodal: More than two modes.
- No mode: If all values appear with equal frequency.
Use in Data Science:
- Most useful for categorical data (e.g., most popular product color).
- Can be used for imputing missing categorical values.
---
Comparison of Central Tendency Measures
| Measure | Best For | Outlier Sensitive? | Data Type |
|---|---|---|---|
| Mean | Symmetric distributions | ✅ Yes (highly) | Numerical |
| Median | Skewed distributions | ⌠No (robust) | Numerical |
| Mode | Categorical data | ⌠No | Numerical & Categorical |
---
Measures of Dispersion (Spread)
Central tendency tells you the center, but two datasets can have the same mean and look completely different. Dispersion measures tell you how spread out the data is.
4. Range
Definition: The difference between the maximum and minimum values. Range = Max - Min
Example: Data: [10, 20, 30, 40, 50] → Range = 50 - 10 = 40
Limitation: Extremely sensitive to outliers; only uses two values.
---
5. Variance (σ²)
Definition: Variance measures the average squared deviation from the mean. It tells you how far each data point is from the mean, on average.
Population Variance Formula: σ² = (1/N) × Σᵢ₌â‚á´º (xáµ¢ - μ)²
Sample Variance Formula (Bessel's Correction): s² = (1/(n-1)) × Σᵢ₌â‚â¿ (xáµ¢ - xÌ„)²
Why squared?
- If we just summed the deviations
(xᵢ - μ), positives and negatives would cancel out and give zero. - Squaring ensures all deviations are positive.
Example: Data: [2, 4, 6, 8, 10], Mean = 6 Deviations: [-4, -2, 0, 2, 4] Squared Deviations: [16, 4, 0, 4, 16] Variance = (16 + 4 + 0 + 4 + 16) / 5 = 8
---
6. Standard Deviation (σ)
Definition: The square root of the variance. It has the same unit as the original data, making it more interpretable than variance.
σ = √(σ²)
From the example above: σ = √8 ≈ 2.83
Interpretation:
- A low SD means data points are close to the mean (consistent data).
- A high SD means data points are spread out (variable data).
---
Variance vs Standard Deviation
| Feature | Variance (σ²) | Standard Deviation (σ) |
|---|---|---|
| Unit | Squared units (e.g., kg²) | Same as data (e.g., kg) |
| Interpretability | Less intuitive | More intuitive |
| Use in ML | Used in formulas (e.g., Normal Distribution) | Used for data description |
| Sensitivity | Sensitive to outliers | Sensitive to outliers |
---
The Normal Distribution & Standard Deviation
The Normal (Gaussian) Distribution is defined by its mean (μ) and standard deviation (σ):
The 68-95-99.7 Rule (Empirical Rule):
| Range | Percentage of Data |
|---|---|
| μ ± 1σ | ~68% |
| μ ± 2σ | ~95% |
| μ ± 3σ | ~99.7% |
This means in a normally distributed dataset, almost all data falls within 3 standard deviations of the mean.
Summary
- Mean, Median, and Mode describe the center of data; each has different strengths.
- Median is preferred for skewed data; Mean for symmetric data.
- Variance and Standard Deviation measure data spread.
- Standard Deviation is more interpretable as it uses the same units as the data.
- The 68-95-99.7 rule links Standard Deviation to the Normal Distribution.