Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

1.5 Data Collection & Sampling

Lesson 5 of 32 in the free Data Visualisation and Analytics notes on Siksha Sarovar, written by Rohit Jangra.

Data Collection, Sampling & Distributions

1. Data Collection Sources

Data collection is the foundation. The quality and type of data collected determines everything downstream.

Study Deep: The Central Limit Theorem (CLT)

The CLT is the most important concept in sampling theory.

  • The Idea: If you take large enough samples (usually $n geq 30$) from any population, the distribution of the sample means will follow a Normal Distribution (Bell Curve), even if the original population was highly skewed or irregular.
  • Why it matters: It allows scientists to use "normal" math on "non-normal" data, which is how we make predictions about entire populations using only small groups.

1. Data Collection Sources

FeaturePrimary DataSecondary Data
DefinitionCollected firsthand by the researcher for a specific purposePre-existing data collected by someone else for a different purpose
MethodsSurveys, Interviews, Experiments, Observations, SensorsGovernment Census, Kaggle Datasets, Published Reports, Company Records
CostHigh (time + money)Low (often free or inexpensive)
AccuracyHigh — tailored to your exact needsVariable — may not perfectly fit your question
TimelinessCurrent and up-to-dateMay be outdated
ControlFull control over collection methodologyNo control — must accept as-is
ExampleConducting a customer satisfaction surveyUsing India Census 2021 data for demographic analysis

2. What is Sampling?

Formal Definition: Sampling is the statistical process of selecting a representative subset (sample) from a larger group (population) to estimate characteristics of the entire population without examining every member.

  • Population (N): The entire group of interest (e.g., All 1.4 billion citizens of India).
  • Sample (n): The subset selected for study (e.g., 10,000 citizens).
  • Sampling Frame: The list of all members from which the sample is drawn (e.g., Voter Registration List).
  • Representativeness: The sample should accurately reflect the diversity and proportions of the population.

3. Types of Sampling Methods

MethodCategoryHow It WorksProsConsBest For
Simple RandomProbabilityEvery member has an equal chance (lottery)Unbiased, simpleNeeds complete list of populationSmall, accessible populations
StratifiedProbabilityDivide into strata (Gender, Age), then random sample from eachEnsures all subgroups representedRequires knowledge of strataDiverse populations
ClusterProbabilityRandomly select entire groups (schools, cities)Cost-effective for large areasHigher sampling errorGeographically spread populations
SystematicProbabilitySelect every k-th member (e.g., every 5th person in a list)Easy to implementRisk of hidden patterns in orderOrdered lists
ConvenienceNon-ProbabilitySurvey whoever is easily availableQuick and cheapHighly biasedPilot studies, initial exploration
QuotaNon-ProbabilityFill quotas (50 men, 50 women) non-randomlyEnsures category coverageBiased within groupsMarket research
SnowballNon-ProbabilityParticipants recruit othersAccess to hidden populationsBiased toward social networksRare/sensitive populations

4. Sampling Distribution & Central Limit Theorem (CLT)

Sampling Distribution: Imagine you take 1,000 different samples of size n=50 from a population and calculate the mean for each. If you plot these 1,000 means, the resulting distribution is called the Sampling Distribution of the Sample Means.

Central Limit Theorem (CLT): The CLT states that regardless of the shape of the population distribution, the sampling distribution of the sample means will approach a Normal Distribution (Bell Curve) as the sample size increases (typically n > 30).

Key Properties of CLT:

  • Mean of sampling distribution = Population mean (μ_x̄ = μ)
  • Standard Error = σ / √n (decreases as sample size increases)
  • Shape approaches Normal regardless of original distribution shape

Worked Example: A factory produces bolts with a mean length of 50mm and standard deviation of 5mm. If we take samples of n=100 bolts:

  • Mean of sampling distribution = 50mm
  • Standard Error = 5 / √100 = 0.5mm
  • 95% of sample means will fall between 50 ± 1.96(0.5) = 49.02mm to 50.98mm

5. Errors in Sampling

Error TypeDefinitionCauseSolution
Sampling ErrorDifference between sample statistic and true population parameterRandom chance (inherent in all sampling)Increase sample size (n)
Non-Sampling ErrorSystematic errors unrelated to sampling randomnessBad survey design, data entry mistakes, non-response biasBetter methodology, training, follow-ups
Selection BiasSample is not representative of the populationUsing convenience sampling, excluding certain groupsUse probability sampling methods
Response BiasRespondents give inaccurate answersLeading questions, social desirabilityNeutral wording, anonymous surveys

6. Worked Problem: Normal Distribution + Z-Score (Exam Style)

Problem (University Level): Heights of students at a university are normally distributed with mean μ = 5.5 feet and standard deviation σ = 0.5 feet. What proportion of students are between 5.81 feet and 6.1 feet tall? (Given: P(z < 0.62) = 0.7324 and P(z < 1.2) = 0.8849)

Step 1: Convert to Z-scores:

  • Z₁ = (5.81 - 5.5) / 0.5 = 0.62
  • Z₂ = (6.1 - 5.5) / 0.5 = 1.20

Step 2: Find area between Z₁ and Z₂:

  • P(5.81 < X < 6.1) = P(Z < 1.20) - P(Z < 0.62)
  • = 0.8849 - 0.7324 = 0.1525

Answer: About 15.25% of students have heights between 5.81 and 6.1 feet.

7. Stratified Sampling — Exam Example

Problem: A college has 200 Engineering, 150 Science, and 50 Arts students. We want a stratified sample of 80 students. How many from each stream?

  • Total Population N = 400
  • Engineering: (200/400) × 80 = 40 students
  • Science: (150/400) × 80 = 30 students
  • Arts: (50/400) × 80 = 10 students