Data Collection, Sampling & Distributions
1. Data Collection Sources
Data collection is the foundation. The quality and type of data collected determines everything downstream.
Study Deep: The Central Limit Theorem (CLT)
The CLT is the most important concept in sampling theory.
- The Idea: If you take large enough samples (usually $n geq 30$) from any population, the distribution of the sample means will follow a Normal Distribution (Bell Curve), even if the original population was highly skewed or irregular.
- Why it matters: It allows scientists to use "normal" math on "non-normal" data, which is how we make predictions about entire populations using only small groups.
1. Data Collection Sources
| Feature | Primary Data | Secondary Data |
|---|---|---|
| Definition | Collected firsthand by the researcher for a specific purpose | Pre-existing data collected by someone else for a different purpose |
| Methods | Surveys, Interviews, Experiments, Observations, Sensors | Government Census, Kaggle Datasets, Published Reports, Company Records |
| Cost | High (time + money) | Low (often free or inexpensive) |
| Accuracy | High — tailored to your exact needs | Variable — may not perfectly fit your question |
| Timeliness | Current and up-to-date | May be outdated |
| Control | Full control over collection methodology | No control — must accept as-is |
| Example | Conducting a customer satisfaction survey | Using India Census 2021 data for demographic analysis |
2. What is Sampling?
Formal Definition: Sampling is the statistical process of selecting a representative subset (sample) from a larger group (population) to estimate characteristics of the entire population without examining every member.
- Population (N): The entire group of interest (e.g., All 1.4 billion citizens of India).
- Sample (n): The subset selected for study (e.g., 10,000 citizens).
- Sampling Frame: The list of all members from which the sample is drawn (e.g., Voter Registration List).
- Representativeness: The sample should accurately reflect the diversity and proportions of the population.
3. Types of Sampling Methods
| Method | Category | How It Works | Pros | Cons | Best For |
|---|---|---|---|---|---|
| Simple Random | Probability | Every member has an equal chance (lottery) | Unbiased, simple | Needs complete list of population | Small, accessible populations |
| Stratified | Probability | Divide into strata (Gender, Age), then random sample from each | Ensures all subgroups represented | Requires knowledge of strata | Diverse populations |
| Cluster | Probability | Randomly select entire groups (schools, cities) | Cost-effective for large areas | Higher sampling error | Geographically spread populations |
| Systematic | Probability | Select every k-th member (e.g., every 5th person in a list) | Easy to implement | Risk of hidden patterns in order | Ordered lists |
| Convenience | Non-Probability | Survey whoever is easily available | Quick and cheap | Highly biased | Pilot studies, initial exploration |
| Quota | Non-Probability | Fill quotas (50 men, 50 women) non-randomly | Ensures category coverage | Biased within groups | Market research |
| Snowball | Non-Probability | Participants recruit others | Access to hidden populations | Biased toward social networks | Rare/sensitive populations |
4. Sampling Distribution & Central Limit Theorem (CLT)
Sampling Distribution: Imagine you take 1,000 different samples of size n=50 from a population and calculate the mean for each. If you plot these 1,000 means, the resulting distribution is called the Sampling Distribution of the Sample Means.
Central Limit Theorem (CLT): The CLT states that regardless of the shape of the population distribution, the sampling distribution of the sample means will approach a Normal Distribution (Bell Curve) as the sample size increases (typically n > 30).
Key Properties of CLT:
- Mean of sampling distribution = Population mean (
μ_x̄ = μ) - Standard Error =
σ / √n(decreases as sample size increases) - Shape approaches Normal regardless of original distribution shape
Worked Example: A factory produces bolts with a mean length of 50mm and standard deviation of 5mm. If we take samples of n=100 bolts:
- Mean of sampling distribution = 50mm
- Standard Error =
5 / √100 = 0.5mm - 95% of sample means will fall between
50 ± 1.96(0.5)= 49.02mm to 50.98mm
5. Errors in Sampling
| Error Type | Definition | Cause | Solution |
|---|---|---|---|
| Sampling Error | Difference between sample statistic and true population parameter | Random chance (inherent in all sampling) | Increase sample size (n) |
| Non-Sampling Error | Systematic errors unrelated to sampling randomness | Bad survey design, data entry mistakes, non-response bias | Better methodology, training, follow-ups |
| Selection Bias | Sample is not representative of the population | Using convenience sampling, excluding certain groups | Use probability sampling methods |
| Response Bias | Respondents give inaccurate answers | Leading questions, social desirability | Neutral wording, anonymous surveys |
6. Worked Problem: Normal Distribution + Z-Score (Exam Style)
Problem (University Level): Heights of students at a university are normally distributed with mean μ = 5.5 feet and standard deviation σ = 0.5 feet. What proportion of students are between 5.81 feet and 6.1 feet tall? (Given: P(z < 0.62) = 0.7324 and P(z < 1.2) = 0.8849)
Step 1: Convert to Z-scores:
- Z₁ = (5.81 - 5.5) / 0.5 = 0.62
- Z₂ = (6.1 - 5.5) / 0.5 = 1.20
Step 2: Find area between Z₁ and Z₂:
- P(5.81 < X < 6.1) = P(Z < 1.20) - P(Z < 0.62)
- = 0.8849 - 0.7324 = 0.1525
Answer: About 15.25% of students have heights between 5.81 and 6.1 feet.
7. Stratified Sampling — Exam Example
Problem: A college has 200 Engineering, 150 Science, and 50 Arts students. We want a stratified sample of 80 students. How many from each stream?
- Total Population N = 400
- Engineering: (200/400) × 80 = 40 students
- Science: (150/400) × 80 = 30 students
- Arts: (50/400) × 80 = 10 students