Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

5. Data Engineering & Preprocessing

Lesson 5 of 21 in the free Machine Learning notes on Siksha Sarovar, written by Rohit Jangra.

What is Data Preprocessing?

Real-world data is dirty. It is often incomplete, inconsistent, lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a data mining technique that involves transforming raw data into an understandable format.

Fun Fact: Data Scientists spend 80% of their time cleaning data and only 20% analyzing it.

Key Preprocessing Steps

StepProblem AddressedTechnique
CleaningMissing values, Noise.Imputation (Mean/Median), Removing rows, Smoothing.
IntegrationData from multiple sources.Merging, Joining SQL tables.
TransformationDifferent scales (e.g., Age vs Salary).Normalization (0-1), Standardization (Z-score).
ReductionToo many variables.PCA (Principal Component Analysis), Feature Selection.
EncodingComputers can't understand text.One-Hot Encoding, Label Encoding.

Feature Engineering

The art of creating new features from existing data to improve model performance.

  • Example: Extracting "Day of Week" from a "Timestamp".
  • Example: Creating "BMI" from "Height" and "Weight".

Data Pipelines

Automating the flow of data from source to destination. ELT (Extract, Load, Transform) vs ETL (Extract, Transform, Load).