Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

4.5 MapReduce Types and Input Formats

Lesson 28 of 36 in the free Big Data-1 notes on Siksha Sarovar, written by Rohit Jangra.

4.5.1 The Types: Key-Value Pairs

In Hadoop, every Mapper and Reducer must follow a specific signature:

  • Mapper: (K1, V1) -> list(K2, V2)
  • Reducer: (K2, list(V2)) -> list(K3, V3)

Common Hadoop Writable Types:

  • IntWritable, LongWritable
  • Text (Alternative to Java String)
  • FloatWritable, DoubleWritable
  • NullWritable (When you only care about the key)

4.5.2 Input Formats: How Hadoop Reads Data

The InputFormat is the class that splits the input files and provides the RecordReader to the Mapper.

  1. TextInputFormat: The default. It treats each line of each file as a new value. The key is the byte offset of the line from the start of the file.
  2. CombineFileInputFormat: Designed to solve the "Small Files Problem." It bundles many small files into a single Split so that one Mapper can process multiple files, reducing the number of total mappers.
  3. MultipleInputs: Allows you to specify different InputFormats and Mappers for different input paths. This is essential for joining two different types of data (e.g., JSON logs and CSV user data).

4.5.3 The Concept of InputSplits

An InputSplit is a logical chunk of data that one Mapper processes.

  • Split vs Block: A block is a physical piece of data on disk. An InputSplit is a logical pointer. If a record (like a long line of text) spans across two blocks, HDFS must pull the "extra" bit over the network.
  • The Data Locality Goal: Hadoop tries to make InputSplits match HDFS Blocks perfectly so that the Mapper runs on the same machine where the data is stored.

4.5.4 Custom RecordReaders

If your data isn't line-based (e.g., a PDF, an Image, or a multi-line XML tag), you must create a custom RecordReader.

  • Logic: It defines where a record starts and where it ends.
  • Progress: It reports to Hadoop what percentage of the split has been processed (0.0 to 1.0).