Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

4.4 The Heart of MapReduce: Shuffle and Sort

Lesson 27 of 36 in the free Big Data-1 notes on Siksha Sarovar, written by Rohit Jangra.

4.4.1 Understanding the Shuffle and Sort

The Shuffle and Sort is the stage where the output of the Mappers is moved to the Reducers. It is often the most expensive part of a job because it involves heavy network traffic, disk I/O, and CPU for sorting.

1. The Map Side: The Ring Buffer

Every Map task has a circular memory buffer where it writes its output. By default, this buffer is 100 MB (io.sort.mb).

  • The Spill Process: When the buffer reaches a threshold (usually 80% or 80 MB), a background thread starts "spilling" the data to the local disk.
  • Sorting & Partitioning: BEFORE the data is written to disk, it is partitioned (to decide which reducer gets it) and then sorted by key using an in-memory Quicksort.
  • Combiner: If a Combiner is specified, it runs on the spilled data in memory, reducing the local footprint before it touches the disk.
  • Merging: A single Map task might generate dozens of small "spill" files. Before finishing, it merges all these spill files into a single, partitioned and sorted output file.

2. The Reduce Side: Copy and Merge

The "Shuffle" phase on the Reducer side consists of three sub-phases:

  1. Copy Phase: As soon as each Map task finishes, the Reducers start pulling their respective partitions from the Mapper's local disk over HTTP.
  2. Merge Phase: As files arrive, the Reducer merges them together while maintaining the sort order. This happens in several "rounds"—first in memory, then on disk.
  3. Secondary Sorting: Sometimes, you need to sort the values for a given key. This is achieved using a Composite Key (Key + Value) and a custom Grouping Comparator which tells Hadoop that even though the keys are different, they belong to the same reduce() call.
StageResource UsedKey Optimization
Map WritingMemory (100MB Buffer)Increase io.sort.mb if records are large.
SpillingLocal Disk I/OUse SSDs for local storage.
ShuffleNetwork BandwidthUse Compression (e.g., LZO or Snappy).
Sort/MergeCPU / RAMAdjust memory allocated to the Reducer.
ReduceCPU / HDFS WriteOptimize the logic of the reduce() function.