Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

Unit 1 Overview

Lesson 1 of 32 in the free Data Warehousing and Data Mining(Elective-II) notes on Siksha Sarovar, written by Rohit Jangra.

Unit I: Basics Concepts of Data Ware Housing

Study Deep: The ETL Process (Extract, Transform, Load)

The backbone of any Data Warehouse is the ETL process. It is the "factory" that turns raw data into clean, analytical information.

  1. Extraction: Pulling data from multiple, often incompatible, source systems.
  2. Transformation: The most complex step. It includes:
  • Cleaning: Fixing errors (e.g., "M" to "Male").
  • Filtering: Removing irrelevant data.
  • Joining: Combining data from different sources.
  • Aggregation: Summarizing data (e.g., daily sales to monthly).
  1. Loading: Physically moving the cleaned data into the Data Warehouse or Data Mart.

Need for data warehouse, definition, Database Vs data warehouse, Data Mart, Data warehouse architecture, Star, snowflake and galaxy schemas for multidimensional databases, Defining various schemas, fact and dimension data, Concept hierarchies, Metadata repository, back end tools and utilities.

1. The Need for Data Warehousing

1.1 The Evolution of Data Management

In the early days of computing, organizations focused on automating operational processes (like payroll, inventory, and order processing). These systems are known as Operational Systems or OLTP (Online Transaction Processing) systems. While excellent for day-to-day operations, they failed to provide strategic decision-making support.

1.2 Limitations of Operational Systems

Operational systems are designed for speed and accuracy in recording transactions, not for analysis. The specific limitations that create the need for a data warehouse include:

  1. Data is Fragmented: Data is often scattered across different departments (Sales, HR, Finance) in different formats (Excel, Mainframes, SQL databases). It is difficult to get a "single version of the truth."
  2. Schema Design Mismatch: Operational databases use Normalized Schemas (3NF) to reduce redundancy. This splits data into many tables. A simple query like "Total sales by region for last year" might require joining 20 tables, which is computationally expensive and slows down the operational system.
  3. Historical Data Issues: Operational systems usually only keep current data (e.g., "Current Order Status"). Once an order is shipped and closed, the record might be archived or deleted. Decision-makers need historical trends (e.g., "Sales growth over 5 years").
  4. Performance Conflict: Running complex analytical queries on an operational database consumes resources (CPU, I/O), which slows down the transaction processing. If the CEO runs a heavy report, the cashier at the counter might experience a lag in scanning items.
  5. Lack of Integration: Different systems use different codes. For example, "Male" might be stored as 'M' in one system and '1' in another. A data warehouse integrates these inconsistencies.

1.3 The Solution: Data Warehousing

To solve these problems, organizations needed a separate environment designed specifically for analysis. This environment is the Data Warehouse. It acts as a central repository where data is extracted from various sources, cleaned, transformed, and stored for analytical purposes.