Aim
To implement one-way ANOVA (Analysis of Variance) manually with NumPy — computing the sums of squares (SS), degrees of freedom, mean squares (MS) and the F-statistic for three groups of marks — and to present the result as a standard ANOVA table.
CO Mapping: CO1, CO2, CO3, CO5
Theory
One-way ANOVA tests whether three or more group means are equal. The hypotheses are:
- H₀: μ_A = μ_B = μ_C (all group means equal);
- H₁: at least one group mean differs.
The core idea is to split the total variability of all observations around the grand mean into two independent sources:
- Between-group variability (SS_between): how far each group's mean sits from the grand mean, weighted by group size — the part of the spread explained by group membership.
- Within-group variability (SS_within): how much observations scatter around their own group mean — pure noise that group membership cannot explain.
The identity SS_total = SS_between + SS_within always holds. Each SS is divided by its degrees of freedom (df_between = k − 1, df_within = n − k) to give mean squares, and the test statistic is their ratio:
F = MS_between / MS_within
Under H₀ both mean squares estimate the same error variance, so F ≈ 1. If the groups genuinely differ, MS_between inflates and F grows far beyond 1. The computed F is compared against the critical value of the F-distribution with (k − 1, n − k) degrees of freedom — here (2, 12), whose 5% critical value is about 3.89. Why not just run three t-tests? Each pairwise t-test carries its own 5% false-alarm risk, and the risks compound; ANOVA asks the question once, with one controlled error rate.
Dataset
Marks of three groups of 5 students each (k = 3, n = 15):
| Group | Values | Mean |
|---|---|---|
| A | 72, 75, 78, 71, 74 | 74.0 |
| B | 81, 85, 79, 84, 83 | 82.4 |
| C | 66, 69, 70, 68, 67 | 68.0 |
Grand mean = 1122 / 15 = 74.8.
Procedure
- Define
group_a,group_b,group_cas NumPy arrays and collect them ingroups; buildall_valueswithnp.concatenate. - Compute
grand_mean = all_values.mean()(74.8), plusk = 3andn = 15. - Compute
ss_betweenas Σ nᵢ (x̄ᵢ − grand mean)² over the groups, andss_withinas the sum of each group's squared deviations from its own mean. - Compute
ss_totaldirectly from all 15 values and verify it equalsss_between + ss_within. - Divide by the degrees of freedom (
df_between = 2,df_within = 12) to getms_betweenandms_within, thenf_value = ms_between / ms_within. - Assemble
anova_tableas a DataFrame with Source, SS, df, MS and F columns and print it rounded to 4 decimals.
Interpretation of Results
Tracing the arithmetic: SS_between = 5(74.0 − 74.8)² + 5(82.4 − 74.8)² + 5(68.0 − 74.8)² = 3.2 + 288.8 + 231.2 = 523.2; SS_within = 30 + 23.2 + 10 = 63.2; SS_total = 586.4 (the identity checks out). Then MS_between = 523.2 / 2 = 261.6, MS_within = 63.2 / 12 ≈ 5.2667, and F ≈ 49.6709. That is more than twelve times the 5% critical value F(2,12) ≈ 3.89, so H₀ is emphatically rejected: the three group means (74.0, 82.4, 68.0) are not chance fluctuations around a common mean. The table itself tells the story — group membership explains 523.2 of the 586.4 total sum of squares (about 89%), while within-group noise is small and remarkably uniform (each group's marks stay within a few points of its own mean). Note ANOVA only says some difference exists; identifying which pairs differ needs a post-hoc test such as Tukey's HSD.
Common Mistakes
- Forgetting to weight SS_between by group size
len(g)— with unequal groups the unweighted version is simply wrong. - Mixing up the degrees of freedom (using n − 1 for within, or k for between) — the F ratio then references the wrong distribution.
- Concluding from a large F that every group differs from every other — ANOVA is an omnibus test; pairwise conclusions need post-hoc analysis.
🎯 Viva Questions
- What are H₀ and H₁ in one-way ANOVA? H₀: all group means are equal; H₁: at least one differs.
- What does the F-statistic measure? The ratio of between-group variance to within-group variance — how much more the groups differ from each other than their members do internally.
- Why is F ≈ 1 expected under H₀? Both MS_between and MS_within then estimate the same underlying error variance.
- What are the degrees of freedom here? Between: k − 1 = 2; Within: n − k = 12; Total: n − 1 = 14.
- What identity links the sums of squares? SS_total = SS_between + SS_within (586.4 = 523.2 + 63.2 in this data).
- Why use ANOVA instead of multiple t-tests? Repeated t-tests inflate the overall Type-I error rate; ANOVA tests all means at once with a single controlled α.