The Infrastructure of Big Data
You cannot process Big Data using a single computer. You need a Cluster—a collection of interconnected computers working together.
Apache Hadoop: The Foundation
Hadoop is an open-source framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
The Three Pillars of Hadoop:
- HDFS (Hadoop Distributed File System):
- Concept: It breaks large files into smaller "blocks" and distributes them across many machines.
- Redundancy: It automatically creates copies of each block. If one machine fails, the data is still safe on another.
- MapReduce (The Processor):
- Map: Filters and sorts data (e.g., Count the words in each document).
- Reduce: Aggregates the results (e.g., Sum the counts from all documents).
- YARN (Yet Another Resource Negotiator):
- The "Operating System" of Hadoop. It decides which machine does which job and prevents any single machine from being overloaded.
1.6.2 The "Small Files" Problem in HDFS
HDFS is designed for large files. Every file, directory, and block in HDFS is represented as an object in the Namenode's memory, taking up about 150 bytes.
- The Problem: If you store 1 million 1KB files instead of one 1GB file, you consume massive amounts of Namenode RAM while severely hurting I/O performance.
- The solution: Using Hadoop Archives (HAR) or SequenceFiles to bundle small files together.
1.6.3 HDFS Federation
In massive clusters, a single Namenode becomes a bottleneck. HDFS Federation solves this by using multiple independent Namenodes.
- Namespace Volumes: Each Namenode manages its own part of the file system (e.g., one for /user, one for /data).
- Block Pool: All Datanodes store blocks from all Namenodes, but Namenodes don't talk to each other.
The Modern Technology Stack
| Technology | Role | Key Feature |
|---|---|---|
| Apache Spark | Processing | Up to 100x faster than MapReduce because it works "In-Memory." |
| Apache Kafka | Ingestion | Handles trillions of events per day in real-time streams. |
| NoSQL (MongoDB) | Database | Stores unstructured data without needing a rigid schema. |
| Cloud (AWS/GCP) | Infrastructure | Provides "Elastic" hardware on demand. |
Open Source and Big Data
The Big Data world is dominated by Open Source.
- Why? Because the field moves too fast for any single company to own.
- Apache Software Foundation: The home for most key projects (Hadoop, Spark, Hive, Cassandra, etc.).
- Community Drive: Thousands of developers globally contribute code, ensuring the tools remain cutting-edge and free to use.
1.6.4 The Commercial Landscape: Big Data Vendors
Since Apache Hadoop is complex to set up, several companies created pre-packaged "Distributions" that include security, management tools, and support.
| Vendor | Distribution | Specialized Feature |
|---|---|---|
| Cloudera | CDH (Cloudera Distribution including Hadoop) | Focused on enterprise security and "Cloudera Manager." |
| Hortonworks | HDP (Hortonworks Data Platform) | Famous for being 100% open source without proprietary extensions. |
| MapR | MapR Converged Data Platform | Used a custom C++ based file system (MapR-FS) instead of HDFS for speed. |
| Cloud Providers | AWS EMR / Azure HDInsight | Managed Hadoop services that scale on-demand. |
1.6.5 Case Study: Walmart's Retail Intelligence
Walmart uses Big Data to manage a supply chain of over 11,000 stores.
- The Problem: How to ensure that snow shovels are on the shelves before a blizzard hits, without overstocking and wasting money?
- The Solution: By analyzing 200 billion rows of data daily—from weather forecasts to past local purchase history—Walmart uses Hadoop clusters to predict demand at a hyper-local level.
- Impact: They increased the correlation between social media trends and product stocking, leading to a 10-15% increase in online sales conversion.
1.6.6 The Evolution: From Mainframes to Hadoop
Big Data didn't replace traditional computing; it evolved from it to solve specific volume problems.
| Feature | Mainframe / Legacy SAN | Hadoop (Big Data) |
|---|---|---|
| Storage Model | Centralized, high-end storage. | Distributed, commodity hardware. |
| Cost | Expensive ($$$ per Gigabyte). | Cheap ($ per Terabyte). |
| Scalability | Vertical (Scaling Up). | Horizontal (Scaling Out). |
| Failure Handling | Redundant hardware (RAID). | Redundant software (Replication). |
| Data Locality | Data moves to the code. | Code moves to the data. |
- Reduced Bandwidth: Sending only the "Alert" instead of 24/7 video streams.
1.6.8 Data Storage Architecture Evolution
Modern enterprises are moving beyond simple Hadoop clusters toward integrated architectures.
| Architecture | Storage | Schema | Key Use Case |
|---|---|---|---|
| Data Warehouse | Structured (RDBMS) | Schema-on-Write | Business Intelligence (BI) and Reporting. |
| Data Lake | Raw (HDFS/S3) | Schema-on-Read | Data Science and Machine Learning. |
| Data Lakehouse | Structured on Raw | Optimized Metadata | Real-time analytics on unstructured data. |
1.6.9 Big Data Governance & Security
Processing data is easy; securing it is hard.
- Apache Atlas: Provides data lineage (tracking where data came from and who touched it).
- Apache Ranger: A centralized security framework to manage fine-grained access control across the whole Hadoop ecosystem.