Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

4.6 Output Formats & Advanced Job Optimization

Lesson 29 of 36 in the free Big Data-1 notes on Siksha Sarovar, written by Rohit Jangra.

4.6.1 Output Formats: How Hadoop Writes Data

The OutputFormat defines how the final key-value pairs from the Reducers are written to HDFS.

  1. TextOutputFormat: The default. Writes each pair as a line in a text file (Key <tab> Value).
  2. SequenceFileOutputFormat: Writes in binary format. Best if the output is going to be the input of another Hadoop job.
  3. LazyOutputFormat: Only creates an output file if at least one record is actually written. This prevents thousands of empty part-r-0000 files.
  4. MultipleOutputs: Allows a single Reducer to write to different files based on the data (e.g., "Success" records to one file, "Error" records to another).

4.6.2 Preventing Data Skew

Data Skew occurs when one Reducer gets 90% of the data while the others are idle. This is the "Long Tail" problem.

  • Cause: Using a key that isn't evenly distributed (e.g., partition by "Country" where 90% of users are from "India").
  • Solution: Use a more granular key or a custom Partitioner that adds a random salt to hot keys.

4.6.3 Counter Implementation

Hadoop Counters provide a way to gather statistics across the whole cluster.

4.6.4 Advanced Joining Techniques

Joining two large datasets is the most common complex task in Big Data.

1. Map-Side Join (The Broadcast Join)

Used when one dataset is small enough to fit in memory.

  • Mechanism: The small dataset is sent to every mapper in the cluster. Each mapper loads it into a Hashmap (e.g., using Hadoop's DistributedCache). As the mapper reads the large dataset, it looks up the keys in the memory hashmap.
  • Pros: No Shuffle, no Sort. It is the fastest possible join.
  • Requirement: Data must be stored in a way that allows it to be processed locally (e.g., pre-partitioned and sorted by join key).

2. Reduce-Side Join (The Standard Join)

Used when both datasets are massive.

  • Mechanism: Mappers read both datasets and tag each record with its source (e.g., Tag 1 for UserData, Tag 2 for OrderData). They emit the join key as the key.
  • Shuffle: Hadoop ensures all records with the same join key (from both sources) end up on the same Reducer.
  • Reducer: The Reducer receives a list of records for a key and performs the join logic across the different tags.

4.6.5 Troubleshooting Data Skew (Hot Keys)

If you see one reducer taking 10x longer than others, you have Data Skew.

  • The Salted Key Strategy: Append a random integer (e.g., 0-9) to the key on the Map side. This forces Hadoop to distribute that one "hot" key across 10 different reducers. On the second job, you remove the salt and aggregate the partial results.
  • Custom Partitioning: You can write a Java class that implements Partitioner to manually decide which keys go to which reducers, effectively balancing the load.
  • Pros: No Shuffle phase! Extremely fast.

2. Reduce-Side Join

The standard way to join two large datasets.

  • How: Data from both sets is marked with a "Tag" and sent to the Reducers using the Join Key. The Reducer then groups and joins the records.
  • Cons: Expensive Shuffle and Sort.

4.6.5 Troubleshooting Data Skew

If one reducer is taking 2 hours while others take 2 minutes, you have Data Skew.

  • The Salted Key Pattern: Appending a random number (1 to 10) to the join key to force Hadoop to distribute the "hot" key across 10 different reducers.
  • Custom Partitioning: Overriding the default HashPartitioner to ensure specific logic for distribution.

--- Unit IV Checklist:

  • [x] Explain the shift from Classic MapReduce (JobTracker) to YARN (ResourceManager).
  • [x] Describe the three pillars of YARN: RM, NM, and AM.
  • [x] Define "Speculative Execution" and why it is used.
  • [x] List the steps of the Shuffle and Sort process in order.
  • [x] Differentiate between TextInputFormat and SequenceFileInputFormat.
  • [x] Explain how to handle "Data Skew" using a custom Partitioner.

---

4.6.6 Summary: MapReduce vs Apache Spark

While MapReduce was the pioneer, many organizations are moving to Spark.

FeatureMapReduceApache Spark
SpeedSlower (Writes to disk between phases).Up to 100x faster (Processes in-memory).
ModelOnly Map and Reduce.General DAG (Transformations & Actions).
ComplexityHarder to write (Verbose Java).Easier (Scala, Python, Java).
StateStateless.Supports iterative algorithms (ML).

4.6.7 MapReduce Best Practices

  1. Avoid Small Files: They kill Namenode memory and slow down mappers.
  2. Tune io.sort.mb: Larger buffers reduce the number of spills to disk.
  3. Use Compression: Snappy or LZO reduces network I/O during shuffle.
  4. Balance the Reducers: Use custom partitioners to prevent data skew.

4.6.9 Troubleshooting MapReduce Pipelines

When a job fails on a 1000-node cluster, finding the "Needle in the Haystack" is key:

  1. ApplicationMaster Logs: The first place to look. It tracks task attempts and container allocations.
  2. The Web UI (Port 19888): The "JobHistory Server" allows you to see exactly which node failed and view the specific stderr and syslog for that task.
  3. Data Skew Analysis: Check if the "Elapsed Time" for one reducer is significantly higher than others.
  4. Memory Tuning: If you see java.lang.OutOfMemoryError, you may need to increase mapreduce.map.memory.mb or mapreduce.reduce.memory.mb.

4.6.10 MapReduce Framework Comparison

FeatureHadoop MapReduceApache TezApache Spark
Iterative JobsPoorGoodExcellent

4.6.11 The MapReduce Ecosystem: Libraries

You don't always have to write raw Java. Several libraries build on top of MapReduce:

  1. Apache Mahout: A library for scalable Machine Learning (Clustering, Classification) that runs on MapReduce.
  2. Apache Giraph: An iterative graph processing system built on top of Hadoop (modeled after Google's Pregel).
  3. Apache Phoenix: Adds a SQL layer on top of HBase (NoSQL) which uses MapReduce for heavy aggregates.

4.6.12 MapReduce Quick Tips for Developers

  • Start Small: Test your mappers and reducers on a small local dataset before deploying to the cluster.
  • Monitor the Sort: If your shuffle phase is the bottleneck, check if your keys are too large or if you have data skew.
  • Cleanup: Always use the cleanup() method to close database connections or file handles.

4.7 Real-World Case Study: Hadoop at Twitter

Twitter uses Hadoop to process trillions of events and hundreds of petabytes of data for:

  • Search Indexing: Using MapReduce to build inverted indexes of every tweet.
  • User Recommendations: Analyzing the "Social Graph" to suggest who to follow.
  • Ad Targeting: Calculating real-time engagement metrics for advertisers.

4.8 Further Reading & Certifications

To continue your Big Data journey, consider exploring:

  1. Cloudera Certified Associate (CCA): Focuses on core Hadoop and Spark skills.
  2. Google Professional Data Engineer: Focuses on Cloud-native Big Data (BigQuery, Dataflow).
  3. AWS Certified Data Analytics: Specializes in EMR, Redshift, and Kinesis.

FINAL COURSE SUMMARY: BIG DATA - 1

Through these four units, we have traversed the entire landscape of modern Big Data. We started with the theoretical underpinnings and the 5 Vs, moved into the complex world of NoSQL data models, mastered the architecture of HDFS, and finally deep-dived into the execution engine of MapReduce and YARN. This foundation prepares you for advanced topics like Real-time processing with Spark and Streaming analytics with Kafka.