Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

2.6 The Map-Reduce Computational Model

Lesson 15 of 36 in the free Big Data-1 notes on Siksha Sarovar, written by Rohit Jangra.

2.6.1 The Philosophy of Map-Reduce

Map-Reduce is a programming model designed to process vast amounts of data in parallel by splitting the task across a cluster.

2.6.2 The Three Main Phases

  1. Map Phase:
  • Takes input data and converts it into a set of Key-Value pairs.
  • Example: Filtering out invalid records.
  1. Shuffle/Sort Phase:
  • The system (Hadoop) moves all values with the same key to the same machine.
  1. Reduce Phase:
  • Aggregates the data.
  • Example: Summing up all the scores for a specific User ID.

2.6.3 Partitioning and Combining

  • Partitioning: Deciding which "Reducer" machine gets which key. Usually done using a hash function.
  • Combining: A "Mini-Reducer" that runs on the Mapper machine. It sums up local results before sending them across the network, saving massive bandwidth.

2.6.4 Composing Map-Reduce Calculations

In complex Big Data pipelines, one Map-Reduce job is usually not enough. We use Composition:

  • Chaining: The output of Job A becomes the input of Job B.
  • Workflows: Tools like Apache Oozie or Airflow manage complex graphs of Map-Reduce jobs where some jobs run in parallel and others wait for results.

Example: Analyzing Twitter Trends

  • Job 1: Remove stop words (the, is, and) and map hashtag -> 1.
  • Job 2: Count the frequency of each hashtag.
  • Job 3: Sort the hashtags by frequency to find the Top 10.

2.6.5 Advanced MapReduce Patterns

  1. In-Mapper Combining: Instead of sending every single "1" to a combiner, the Mapper keeps a local Hashmap in memory and only emits the final tally for that block, saving even more bandwidth.
  2. Secondary Sorting: A trick to make the Reducer receive not just sorted keys, but also sorted values for those keys. This is done by creating a "Composite Key" that includes the value as part of the sorting logic.

--- Unit II Checklist:

  • [x] Explain the "Impedance Mismatch" problem.
  • [x] Differentiate between Key-Value, Document, and Graph models.
  • [x] Explain Sharding and Replication (Master-Slave vs Peer-to-Peer).
  • [x] Define the CAP Theorem and Eventually Consistent (BASE).
  • [x] Describe the Map, Shuffle, and Reduce phases of computation.