SciPy: Scientific & Statistical Computing
Definition: SciPy(Scientific Python) is an open - source library that builds on NumPy to provide additional functionality for scientific and technical computing.It includes modules for optimization, integration, interpolation, linear algebra, signal processing, and statistical testing .
import scipy
---
SciPy vs NumPy
| Feature | NumPy | SciPy |
|---|---|---|
| Focus | Array operations | Scientific computing |
| Statistics | Basic (mean, std) | Advanced (t-test, chi-square, ANOVA) |
| Linear Algebra | Basic operations | Advanced (sparse matrices, eigenvalues) |
| Optimization | Not available | Curve fitting, minimization |
| Integration | Not available | Numerical integration |
| Signal Processing | FFT only | Filters, convolutions |
| Relationship | Foundation | Built on top of NumPy |
---
Key SciPy Modules
| Module | Import | Purpose |
|---|---|---|
scipy.stats | from scipy import stats | Statistical functions & tests |
scipy.optimize | from scipy import optimize | Optimization & curve fitting |
scipy.linalg | from scipy import linalg | Linear algebra (beyond NumPy) |
scipy.integrate | from scipy import integrate | Numerical integration |
scipy.interpolate | from scipy import interpolate | Interpolation |
scipy.signal | from scipy import signal | Signal processing |
scipy.sparse | from scipy import sparse | Sparse matrices |
---
scipy.stats — Statistical Functions
Descriptive Statistics
from scipy import stats
import numpy as np
data = [23, 45, 12, 67, 34, 89, 56, 78, 45, 34]
print("Mean:", np.mean(data))
print("Median:", np.median(data))
print("Mode:", stats.mode(data))
print("Skewness:", stats.skew(data))
print("Kurtosis:", stats.kurtosis(data))
| Function | Description |
|---|---|
stats.describe(data) | Complete statistical summary |
stats.skew(data) | Measure of asymmetry |
stats.kurtosis(data) | Measure of tail heaviness |
stats.zscore(data) | Z-scores for outlier detection |
stats.mode(data) | Most frequent value |
stats.sem(data) | Standard error of the mean |
---
Hypothesis Testing with SciPy
1. t-Test (Compare Means)
One-Sample t-Test: Is the sample mean different from a known value?
t_stat, p_value = stats.ttest_1samp(data, popmean=50)
print(f"t-statistic: {t_stat:.3f}, p-value: {p_value:.3f}")
Two-Sample t-Test: Are two groups significantly different?
group_a = [85, 90, 78, 92, 88]
group_b = [70, 75, 68, 80, 72]
t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
print("Significant difference!")
2. Chi-Square Test (Categorical Data)
Tests whether two categorical variables are independent.
observed = [[30, 10], [20, 40]]
chi2, p_value, dof, expected = stats.chi2_contingency(observed)
print(f"Chi-Square: {chi2:.3f}, p-value: {p_value:.3f}")
3. ANOVA (Compare 3+ Groups)
Tests whether means of 3 or more groups are different.
group1 = [85, 90, 78, 92]
group2 = [70, 75, 68, 80]
group3 = [60, 65, 58, 72]
f_stat, p_value = stats.f_oneway(group1, group2, group3)
print(f"F-statistic: {f_stat:.3f}, p-value: {p_value:.4f}")
Statistical Tests Summary
| Test | Function | Use Case |
|---|---|---|
| One-Sample t-Test | stats.ttest_1samp() | Compare sample mean to known value |
| Two-Sample t-Test | stats.ttest_ind() | Compare means of 2 independent groups |
| Paired t-Test | stats.ttest_rel() | Compare before/after measurements |
| Chi-Square | stats.chi2_contingency() | Independence of categorical variables |
| ANOVA | stats.f_oneway() | Compare means of 3+ groups |
| Mann-Whitney U | stats.mannwhitneyu() | Non-parametric alternative to t-test |
| Shapiro-Wilk | stats.shapiro() | Test for normality |
| Pearson Correlation | stats.pearsonr() | Linear correlation between two variables |
| Spearman Correlation | stats.spearmanr() | Monotonic correlation (non-linear) |
---
Probability Distributions
SciPy provides access to 100+ probability distributions:
# Normal Distribution
from scipy.stats import norm
x = norm.rvs(loc=0, scale=1, size=1000) # Generate random samples
pdf = norm.pdf(0) # Probability density at x=0
cdf = norm.cdf(1.96) # Cumulative probability up to 1.96
ppf = norm.ppf(0.975) # Inverse CDF (percentile)
| Method | Description |
|---|---|
.rvs() | Random samples |
.pdf() | Probability Density Function |
.cdf() | Cumulative Distribution Function |
.ppf() | Percent Point Function (inverse CDF) |
.mean() | Distribution mean |
.std() | Distribution standard deviation |
---
scipy.optimize — Curve Fitting
from scipy.optimize import curve_fit
def model(x, a, b):
return a * x + b
x_data = np.array([1, 2, 3, 4, 5])
y_data = np.array([2.2, 4.1, 5.8, 8.3, 9.9])
params, covariance = curve_fit(model, x_data, y_data)
print(f"a = {params[0]:.2f}, b = {params[1]:.2f}")
---
SciPy in Data Science
| Application | SciPy Module | How It's Used |
|---|---|---|
| A/B Testing | scipy.stats | t-tests to compare conversion rates |
| Feature Selection | scipy.stats | Chi-square tests for categorical features |
| Normality Testing | scipy.stats | Shapiro-Wilk test before parametric tests |
| Optimization | scipy.optimize | Minimizing loss functions |
| Interpolation | scipy.interpolate | Filling gaps in time series data |
| Sparse Data | scipy.sparse | Efficient storage for text data (TF-IDF) |
Summary
- SciPy extends NumPy with advanced scientific computing functionality.
scipy.statsis the most important module for data scientists — it provides hypothesis testing, distributions, and correlations.- t-tests, chi-square, and ANOVA are essential for statistical analysis and A/B testing.
- SciPy's probability distribution functions (
pdf,cdf,ppf) are used for statistical modeling. - Optimization and curve fitting tools help in model development.