Hypothesis Testing
Hypothesis Testing is a structured, statistical method for making decisions about a population based on sample data. It helps answer questions like: "Is the difference in performance between Group A and Group B real, or just due to random chance?"
It is the backbone of A/B testing, clinical trials, and scientific research.
---
The Hypothesis Testing Framework
Step 1: State the Hypotheses
- Null Hypothesis (Hâ‚€): The "default" or "status quo" statement. It assumes there is no effect or no difference.
- Example: "The new drug has no effect on blood pressure."
- Alternative Hypothesis (Hâ‚ or Hâ‚): The statement we are trying to find evidence for. It assumes there is an effect or difference.
- Example: "The new drug reduces blood pressure."
Step 2: Choose Significance Level (α)
- The significance level is the threshold for rejection. Common values:
α = 0.05(5%) — Most commonα = 0.01(1%) — More strict- It represents the probability of rejecting H₀ when it is actually true (Type I Error).
Step 3: Collect Data & Calculate Test Statistic
- Perform the experiment or survey.
- Calculate the appropriate test statistic (Z-score, t-score, chi-squared, etc.) based on the data.
Step 4: Calculate the p-value
- The p-value is the probability of observing the test results (or more extreme) assuming Hâ‚€ is true.
- Small p-value → Strong evidence against H₀.
Step 5: Make a Decision
| Condition | Decision |
|---|---|
| p-value ≤ α | Reject H₀ — The result is "statistically significant" |
| p-value > α | Fail to Reject H₀ — Not enough evidence to support H₠|
Important: "Fail to reject Hâ‚€" does NOT mean Hâ‚€ is true. It means we don't have enough evidence to prove it wrong.
---
Types of Errors
| Error Type | Name | What Happened | Consequence |
|---|---|---|---|
| Type I | False Positive | Rejected Hâ‚€ when it was true | Concluded there's an effect when there isn't one |
| Type II | False Negative | Failed to reject Hâ‚€ when it was false | Missed a real effect |
Analogy:
- Type I Error: Fire alarm goes off, but there is no fire (false alarm).
- Type II Error: Fire is burning, but the alarm doesn't go off (missed fire).
Relationship:
- Decreasing Type I Error (making α smaller) → Increases Type II Error risk.
- There is always a trade-off between the two.
---
Common Statistical Tests
| Test | When to Use | Example |
|---|---|---|
| Z-Test | Large sample (n > 30), known population variance | Comparing mean height to national average |
| t-Test | Small sample (n < 30), unknown population variance | Comparing test scores of two small classes |
| Chi-Squared Test | Categorical data (proportions) | Is there a relationship between gender and product preference? |
| ANOVA | Comparing means of 3+ groups | Is there a difference in sales across 4 regions? |
| F-Test | Comparing variances of two groups | Are production line variances equal? |
---
One-Tailed vs Two-Tailed Tests
| Test Type | Hypothesis | When to Use |
|---|---|---|
| One-Tailed | Hâ‚: μ > μ₀ or Hâ‚: μ < μ₀ | You predict the direction of the effect ("Drug reduces BP") |
| Two-Tailed | Hâ‚: μ ≠μ₀ | You just want to know if there's any difference ("Drug changes BP") |
---
Confidence Intervals
Definition: A confidence interval (CI) is a range of values within which we can be "confident" the true population parameter lies.
Formula (for population mean): CI = x̄ ± Z × (σ / √n)
Where:
x̄= Sample meanZ= Z-score corresponding to desired confidence levelσ= Standard deviationn= Sample size
Common Confidence Levels:
| Confidence Level | Z-Score | Interpretation |
|---|---|---|
| 90% | 1.645 | We are 90% confident the true mean is in this range |
| 95% | 1.96 | We are 95% confident the true mean is in this range |
| 99% | 2.576 | We are 99% confident the true mean is in this range |
---
Confidence Interval — Worked Example
A sample of 100 students has a mean exam score of 72 with σ = 10. Calculate the 95% confidence interval. CI = 72 ± 1.96 × (10 / √100) CI = 72 ± 1.96 × 1 CI = 72 ± 1.96 CI = [70.04, 73.96] Interpretation: We are 95% confident that the true population mean exam score lies between 70.04 and 73.96.
---
Key Relationships
| Concept | Connection to Data Science |
|---|---|
| Hypothesis Testing | Powers A/B Testing (website optimization), clinical trials, feature significance |
| p-value | Used to determine if model coefficients are statistically significant |
| Confidence Interval | Provides a range estimate instead of a single point; used in polling, surveys |
| Type I/II Errors | Critical in medical diagnostics and fraud detection |
Confidence Interval vs Hypothesis Testing
| Feature | Confidence Interval | Hypothesis Testing |
|---|---|---|
| Purpose | Estimate a range for the parameter | Test a specific claim about the parameter |
| Output | A range (e.g., [70.04, 73.96]) | Accept or Reject decision |
| Information | More informative (range + direction) | Less informative (binary decision) |
| Relationship | If the CI does not contain the null value, reject H₀ | If p-value ≤ α, reject H₀ |
Summary
- Hypothesis testing is a structured method for making statistical decisions from data.
- The null hypothesis (Hâ‚€) is the default; the alternative (Hâ‚) is what we want to prove.
- p-value measures the strength of evidence against Hâ‚€.
- Type I Error (false positive) and Type II Error (false negative) represent the risks of wrong decisions.
- Confidence intervals provide a range estimate for a parameter at a given confidence level.
- A 95% CI means: if we repeated the experiment 100 times, approximately 95 of those intervals would contain the true value.