Aim
To generate two related random number series with NumPy (Y built as roughly 0.7 × X plus noise), place them in a DataFrame, and measure their linear relationship with the Pearson correlation coefficient via df["X"].corr(df["Y"]).
CO Mapping: CO1, CO2, CO4, CO5
Theory
The Pearson correlation coefficient r measures the strength and direction of the linear relationship between two variables. It is the covariance of X and Y divided by the product of their standard deviations:
r = cov(X, Y) / (σ_X · σ_Y)
That normalisation forces r into the fixed range −1 to +1, making it unit-free and comparable across datasets (unlike raw covariance — see Practical 13). Reading the value: r = +1 means all points lie exactly on an upward-sloping line, r = −1 on a downward one, r = 0 means no linear association. Rules of thumb: |r| above 0.7 is usually called strong, 0.3–0.7 moderate, below 0.3 weak.
This practical constructs correlation deliberately. x is 30 random integers between 10 and 99; y is defined as 0.7 * x plus uniform noise between −10 and +9. Y therefore contains a genuine linear signal (slope 0.7) partially obscured by noise — so r should come out strongly positive but not exactly 1. Because np.random.seed(42) fixes the generator, every run (and every student) gets identical numbers, a reproducibility discipline carried over from Practical 6.
Two standing warnings: correlation is not causation — here we happen to know Y is computed from X because we wrote the code, but with real-world data an r of 0.96 says nothing about direction or mechanism (a lurking third variable may drive both). And r only sees straight lines: a perfect parabola can score r ≈ 0, so always scatter-plot before trusting the number.
Dataset
Generated in the snippet with np.random.seed(42) — 30 pairs; the first five rows printed by df.head() are:
| Index | X | Y |
|---|---|---|
| 0 | 61 | 47 |
| 1 | 24 | 20 |
| 2 | 81 | 60 |
| 3 | 70 | 57 |
| 4 | 30 | 22 |
Procedure
- Seed NumPy with
np.random.seed(42)so the "random" data is reproducible. - Generate
x = np.random.randint(10, 100, 30)— 30 integers in [10, 99]. - Build
y = (0.7 * x + np.random.randint(-10, 10, 30)).astype(int)— a linear function of x with additive noise in [−10, 9]. - Assemble both arrays into the DataFrame
dfwith columns X and Y and printdf.head(). - Compute
corr_value = df["X"].corr(df["Y"])— pandas'.corr()defaults to the Pearson method. - Print the coefficient rounded to 4 decimal places.
Interpretation of Results
With seed 42 the program prints r = 0.9589 — a very strong positive linear correlation, exactly what the construction predicts. Sanity-check it against the printed head: when X is large (81) Y is large (60); when X is small (24) Y is small (20) — the pairs rise and fall together. Why not r = 1.0? The noise term adds up to ±10 marks of scatter around the perfect line 0.7x; relative to X's wide spread (values 11–98, so the signal component spans roughly 8–69), that noise is small, which is why r stays near 1 rather than degrading toward 0.7 or lower. The lesson in reverse: had the noise range been ±50 with the same slope, r would fall sharply — correlation strength reflects the signal-to-noise ratio, not the slope itself. The slope 0.7 sets the direction (positive); the tight noise sets the magnitude (≈ 0.96).
Common Mistakes
- Reading the 0.7 in the formula as the expected correlation — 0.7 is the slope; r depends on how big the noise is relative to the signal, and here lands at 0.9589.
- Omitting the seed and then puzzling over why the printed coefficient differs between runs and from classmates' outputs.
- Declaring "X causes Y" from a high r on real data — correlation is symmetric and silent about causation; only here do we know the mechanism because we coded it.
🎯 Viva Questions
- What range can the Pearson r take? −1 to +1 inclusive; the sign gives direction, the magnitude gives strength of the linear relationship.
- How is r related to covariance? r = cov(X, Y) / (σ_X σ_Y) — covariance normalised by both standard deviations.
- Why is r here 0.9589 and not exactly 1? The uniform noise in [−10, 9] scatters points around the line y = 0.7x, weakening the fit slightly.
- What would r ≈ 0 mean? No linear association — though a strong non-linear relationship (e.g. parabolic) could still exist.
- What does
np.random.seed(42)achieve? Deterministic pseudo-random output, so results are reproducible across runs and machines. - Does r = 0.96 prove X causes Y? No — correlation never establishes causation; here we only know the mechanism because y was explicitly computed from x.