4.6.1 Output Formats: How Hadoop Writes Data
The OutputFormat defines how the final key-value pairs from the Reducers are written to HDFS.
- TextOutputFormat: The default. Writes each pair as a line in a text file (Key <tab> Value).
- SequenceFileOutputFormat: Writes in binary format. Best if the output is going to be the input of another Hadoop job.
- LazyOutputFormat: Only creates an output file if at least one record is actually written. This prevents thousands of empty
part-r-0000files. - MultipleOutputs: Allows a single Reducer to write to different files based on the data (e.g., "Success" records to one file, "Error" records to another).
4.6.2 Preventing Data Skew
Data Skew occurs when one Reducer gets 90% of the data while the others are idle. This is the "Long Tail" problem.
- Cause: Using a key that isn't evenly distributed (e.g., partition by "Country" where 90% of users are from "India").
- Solution: Use a more granular key or a custom Partitioner that adds a random salt to hot keys.
4.6.3 Counter Implementation
Hadoop Counters provide a way to gather statistics across the whole cluster.
4.6.4 Advanced Joining Techniques
Joining two large datasets is the most common complex task in Big Data.
1. Map-Side Join (The Broadcast Join)
Used when one dataset is small enough to fit in memory.
- Mechanism: The small dataset is sent to every mapper in the cluster. Each mapper loads it into a Hashmap (e.g., using Hadoop's DistributedCache). As the mapper reads the large dataset, it looks up the keys in the memory hashmap.
- Pros: No Shuffle, no Sort. It is the fastest possible join.
- Requirement: Data must be stored in a way that allows it to be processed locally (e.g., pre-partitioned and sorted by join key).
2. Reduce-Side Join (The Standard Join)
Used when both datasets are massive.
- Mechanism: Mappers read both datasets and tag each record with its source (e.g., Tag 1 for UserData, Tag 2 for OrderData). They emit the join key as the key.
- Shuffle: Hadoop ensures all records with the same join key (from both sources) end up on the same Reducer.
- Reducer: The Reducer receives a list of records for a key and performs the join logic across the different tags.
4.6.5 Troubleshooting Data Skew (Hot Keys)
If you see one reducer taking 10x longer than others, you have Data Skew.
- The Salted Key Strategy: Append a random integer (e.g., 0-9) to the key on the Map side. This forces Hadoop to distribute that one "hot" key across 10 different reducers. On the second job, you remove the salt and aggregate the partial results.
- Custom Partitioning: You can write a Java class that implements
Partitionerto manually decide which keys go to which reducers, effectively balancing the load.
- Pros: No Shuffle phase! Extremely fast.
2. Reduce-Side Join
The standard way to join two large datasets.
- How: Data from both sets is marked with a "Tag" and sent to the Reducers using the Join Key. The Reducer then groups and joins the records.
- Cons: Expensive Shuffle and Sort.
4.6.5 Troubleshooting Data Skew
If one reducer is taking 2 hours while others take 2 minutes, you have Data Skew.
- The Salted Key Pattern: Appending a random number (1 to 10) to the join key to force Hadoop to distribute the "hot" key across 10 different reducers.
- Custom Partitioning: Overriding the default HashPartitioner to ensure specific logic for distribution.
--- Unit IV Checklist:
- [x] Explain the shift from Classic MapReduce (JobTracker) to YARN (ResourceManager).
- [x] Describe the three pillars of YARN: RM, NM, and AM.
- [x] Define "Speculative Execution" and why it is used.
- [x] List the steps of the Shuffle and Sort process in order.
- [x] Differentiate between TextInputFormat and SequenceFileInputFormat.
- [x] Explain how to handle "Data Skew" using a custom Partitioner.
---
4.6.6 Summary: MapReduce vs Apache Spark
While MapReduce was the pioneer, many organizations are moving to Spark.
| Feature | MapReduce | Apache Spark |
|---|---|---|
| Speed | Slower (Writes to disk between phases). | Up to 100x faster (Processes in-memory). |
| Model | Only Map and Reduce. | General DAG (Transformations & Actions). |
| Complexity | Harder to write (Verbose Java). | Easier (Scala, Python, Java). |
| State | Stateless. | Supports iterative algorithms (ML). |
4.6.7 MapReduce Best Practices
- Avoid Small Files: They kill Namenode memory and slow down mappers.
- Tune io.sort.mb: Larger buffers reduce the number of spills to disk.
- Use Compression: Snappy or LZO reduces network I/O during shuffle.
- Balance the Reducers: Use custom partitioners to prevent data skew.
4.6.9 Troubleshooting MapReduce Pipelines
When a job fails on a 1000-node cluster, finding the "Needle in the Haystack" is key:
- ApplicationMaster Logs: The first place to look. It tracks task attempts and container allocations.
- The Web UI (Port 19888): The "JobHistory Server" allows you to see exactly which node failed and view the specific
stderrandsyslogfor that task. - Data Skew Analysis: Check if the "Elapsed Time" for one reducer is significantly higher than others.
- Memory Tuning: If you see
java.lang.OutOfMemoryError, you may need to increasemapreduce.map.memory.mbormapreduce.reduce.memory.mb.
4.6.10 MapReduce Framework Comparison
| Feature | Hadoop MapReduce | Apache Tez | Apache Spark |
|---|---|---|---|
| Iterative Jobs | Poor | Good | Excellent |
4.6.11 The MapReduce Ecosystem: Libraries
You don't always have to write raw Java. Several libraries build on top of MapReduce:
- Apache Mahout: A library for scalable Machine Learning (Clustering, Classification) that runs on MapReduce.
- Apache Giraph: An iterative graph processing system built on top of Hadoop (modeled after Google's Pregel).
- Apache Phoenix: Adds a SQL layer on top of HBase (NoSQL) which uses MapReduce for heavy aggregates.
4.6.12 MapReduce Quick Tips for Developers
- Start Small: Test your mappers and reducers on a small local dataset before deploying to the cluster.
- Monitor the Sort: If your shuffle phase is the bottleneck, check if your keys are too large or if you have data skew.
- Cleanup: Always use the
cleanup()method to close database connections or file handles.
4.7 Real-World Case Study: Hadoop at Twitter
Twitter uses Hadoop to process trillions of events and hundreds of petabytes of data for:
- Search Indexing: Using MapReduce to build inverted indexes of every tweet.
- User Recommendations: Analyzing the "Social Graph" to suggest who to follow.
- Ad Targeting: Calculating real-time engagement metrics for advertisers.
4.8 Further Reading & Certifications
To continue your Big Data journey, consider exploring:
- Cloudera Certified Associate (CCA): Focuses on core Hadoop and Spark skills.
- Google Professional Data Engineer: Focuses on Cloud-native Big Data (BigQuery, Dataflow).
- AWS Certified Data Analytics: Specializes in EMR, Redshift, and Kinesis.
FINAL COURSE SUMMARY: BIG DATA - 1
Through these four units, we have traversed the entire landscape of modern Big Data. We started with the theoretical underpinnings and the 5 Vs, moved into the complex world of NoSQL data models, mastered the architecture of HDFS, and finally deep-dived into the execution engine of MapReduce and YARN. This foundation prepares you for advanced topics like Real-time processing with Spark and Streaming analytics with Kafka.