Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

1.6 Big Data Technologies: The Ecosystem

Lesson 7 of 36 in the free Big Data-1 notes on Siksha Sarovar, written by Rohit Jangra.

The Infrastructure of Big Data

You cannot process Big Data using a single computer. You need a Cluster—a collection of interconnected computers working together.

Apache Hadoop: The Foundation

Hadoop is an open-source framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

The Three Pillars of Hadoop:

  1. HDFS (Hadoop Distributed File System):
  • Concept: It breaks large files into smaller "blocks" and distributes them across many machines.
  • Redundancy: It automatically creates copies of each block. If one machine fails, the data is still safe on another.
  1. MapReduce (The Processor):
  • Map: Filters and sorts data (e.g., Count the words in each document).
  • Reduce: Aggregates the results (e.g., Sum the counts from all documents).
  1. YARN (Yet Another Resource Negotiator):
  • The "Operating System" of Hadoop. It decides which machine does which job and prevents any single machine from being overloaded.

1.6.2 The "Small Files" Problem in HDFS

HDFS is designed for large files. Every file, directory, and block in HDFS is represented as an object in the Namenode's memory, taking up about 150 bytes.

  • The Problem: If you store 1 million 1KB files instead of one 1GB file, you consume massive amounts of Namenode RAM while severely hurting I/O performance.
  • The solution: Using Hadoop Archives (HAR) or SequenceFiles to bundle small files together.

1.6.3 HDFS Federation

In massive clusters, a single Namenode becomes a bottleneck. HDFS Federation solves this by using multiple independent Namenodes.

  • Namespace Volumes: Each Namenode manages its own part of the file system (e.g., one for /user, one for /data).
  • Block Pool: All Datanodes store blocks from all Namenodes, but Namenodes don't talk to each other.

The Modern Technology Stack

TechnologyRoleKey Feature
Apache SparkProcessingUp to 100x faster than MapReduce because it works "In-Memory."
Apache KafkaIngestionHandles trillions of events per day in real-time streams.
NoSQL (MongoDB)DatabaseStores unstructured data without needing a rigid schema.
Cloud (AWS/GCP)InfrastructureProvides "Elastic" hardware on demand.

Open Source and Big Data

The Big Data world is dominated by Open Source.

  • Why? Because the field moves too fast for any single company to own.
  • Apache Software Foundation: The home for most key projects (Hadoop, Spark, Hive, Cassandra, etc.).
  • Community Drive: Thousands of developers globally contribute code, ensuring the tools remain cutting-edge and free to use.

1.6.4 The Commercial Landscape: Big Data Vendors

Since Apache Hadoop is complex to set up, several companies created pre-packaged "Distributions" that include security, management tools, and support.

VendorDistributionSpecialized Feature
ClouderaCDH (Cloudera Distribution including Hadoop)Focused on enterprise security and "Cloudera Manager."
HortonworksHDP (Hortonworks Data Platform)Famous for being 100% open source without proprietary extensions.
MapRMapR Converged Data PlatformUsed a custom C++ based file system (MapR-FS) instead of HDFS for speed.
Cloud ProvidersAWS EMR / Azure HDInsightManaged Hadoop services that scale on-demand.

1.6.5 Case Study: Walmart's Retail Intelligence

Walmart uses Big Data to manage a supply chain of over 11,000 stores.

  • The Problem: How to ensure that snow shovels are on the shelves before a blizzard hits, without overstocking and wasting money?
  • The Solution: By analyzing 200 billion rows of data daily—from weather forecasts to past local purchase history—Walmart uses Hadoop clusters to predict demand at a hyper-local level.
  • Impact: They increased the correlation between social media trends and product stocking, leading to a 10-15% increase in online sales conversion.

1.6.6 The Evolution: From Mainframes to Hadoop

Big Data didn't replace traditional computing; it evolved from it to solve specific volume problems.

FeatureMainframe / Legacy SANHadoop (Big Data)
Storage ModelCentralized, high-end storage.Distributed, commodity hardware.
CostExpensive ($$$ per Gigabyte).Cheap ($ per Terabyte).
ScalabilityVertical (Scaling Up).Horizontal (Scaling Out).
Failure HandlingRedundant hardware (RAID).Redundant software (Replication).
Data LocalityData moves to the code.Code moves to the data.
  • Reduced Bandwidth: Sending only the "Alert" instead of 24/7 video streams.

1.6.8 Data Storage Architecture Evolution

Modern enterprises are moving beyond simple Hadoop clusters toward integrated architectures.

ArchitectureStorageSchemaKey Use Case
Data WarehouseStructured (RDBMS)Schema-on-WriteBusiness Intelligence (BI) and Reporting.
Data LakeRaw (HDFS/S3)Schema-on-ReadData Science and Machine Learning.
Data LakehouseStructured on RawOptimized MetadataReal-time analytics on unstructured data.

1.6.9 Big Data Governance & Security

Processing data is easy; securing it is hard.

  • Apache Atlas: Provides data lineage (tracking where data came from and who touched it).
  • Apache Ranger: A centralized security framework to manage fine-grained access control across the whole Hadoop ecosystem.