Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

Unit 3: Data Pre-processing

Lesson 20 of 32 in the free Data Warehousing and Data Mining(Elective-II) notes on Siksha Sarovar, written by Rohit Jangra.

Unit III: Data Mining

1. Data Pre-processing

Data in the real world is "dirty." Real-world data is often incomplete, noisy, and inconsistent. Data pre-processing is the critical step of cleaning and transforming raw data into an understandable and usable format. It is often said that 80% of a data miner's time is spent on pre-processing.

1.1 Need for Pre-processing

  • Incomplete: Missing attribute values, lack of certain attributes of interest, or containing only aggregate data.
  • Noisy: Containing errors, outliers, or values that deviate from the expected (e.g., Salary: -500).
  • Inconsistent: Containing discrepancies in codes or names (e.g., "Color" vs "Colour", "M" vs "Male").

1.2 Data Cleaning

Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies.

A. Handling Missing Values

  1. Ignore the Tuple: Usually done when the class label is missing or if the tuple contains significant missing data. Not effective if the dataset is small or many tuples have missing values.
  2. Fill in the Value Manually: Time-consuming and not feasible for large datasets.
  3. Use a Global Constant: Fill all missing values with a label like "Unknown" or -1. Simple, but the mining program might mistakenly think this is an interesting pattern.
  4. Use Attribute Mean: Fill missing values with the mean of that attribute for all samples.
  5. Use Attribute Mean for the Same Class: E.g., if a customer's income is missing, fill it with the average income of customers belonging to the same credit rating class.
  6. Use the Most Probable Value: Use inference-based methods (regression, Bayesian formalism, or decision trees) to predict the most likely value.

B. Handling Noisy Data

Noise is a random error or variance in a measured variable. Techniques to smooth data include:

  1. Binning: Sorted data values are partitioned into "bins".
  2. Regression: Fitting the data into a function to smooth out noise.
  3. Clustering: Grouping similar values into clusters.

1.3 Data Integration

Data integration involves combining data from multiple sources into a coherent data store.

1.4 Data Transformation

Transforming data into forms appropriate for mining.

  • Normalization: Scaling attribute values to fall within a specified range (typically 0.0 to 1.0).

1.5 Data Reduction

Obtaining a reduced representation of the dataset that is much smaller in volume but maintains the integrity of the original data.

StrategyDescriptionTechniques
Dimensionality ReductionReducing the number of random variables/features.PCA, Feature Selection.
Numerosity ReductionReducing the volume of data.Histograms, Clustering, Sampling.
Data CompressionReducing the size of data files.Lossless vs. Lossy.