Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

3.6 Serialization, Avro & Data Structures

Lesson 22 of 36 in the free Big Data-1 notes on Siksha Sarovar, written by Rohit Jangra.

3.6.1 What is Serialization?

Serialization is the process of turning an object in memory (like a Java object) into a binary format that can be sent over the network or saved to disk. Deserialization is the reverse.

Why Hadoop doesn't use standard Java Serialization?

Java's built-in serialization is too bulky (includes class names and overhead) and doesn't allow for easy data exchange between languages (e.g., Java to Python).

3.6.2 Apache Avro

Avro is a data serialization system designed specifically for Hadoop.

  • Rich Schema: Uses JSON to define the data structure.
  • Binary Format: Compact and fast.
  • Schema Evolution: You can add/remove fields in a new version of the code, and Avro will still be able to read old records (and vice versa).
  • Language Neutral: A Java program can write Avro data that a Python or C++ program can read natively.

3.6.3 Serialization Comparison Table

FeatureJava WritableAvroProtocol Buffers (Google)
Language SupportJava-onlyMulti-languageMulti-language
Schema StorageNot requiredIncluded in fileSeparate .proto file
SpeedMediumVery FastFast
CompactnessBulkyExtremely CompactCompact

3.6.4 File-Based Data Structures

To handle massive datasets efficiently, Hadoop uses specialized file formats.

  1. SequenceFiles:
  • Binary format for storing Key-Value pairs.
  • Support compression at both the record and block levels.
  • The "old" workhorse of Hadoop.
  1. MapFiles:
  • A SequenceFile that has an "Index" added to it, allowing for faster lookup of specific keys.
  1. Columnar Formats (Parquet/ORC):
  • Store data by Columns instead of Rows.
  • Analogy: If you only need to average the "Ages" of a million users, a columnar format only reads the "Age" column, skipping the Names, Addresses, etc. This is 10x faster for analytical queries.

--- Unit III Checklist:

  • [x] Explain why "Data Locality" is important in Big Data.
  • [x] Differentiate between Hadoop Streaming and Hadoop Pipes.
  • [x] Describe the roles of Namenode and Datanode in HDFS.
  • [x] Explain the Read and Write data flow in HDFS.
  • [x] Identify the benefits of splitable compression codecs.
  • [x] Understand why Apache Avro is preferred over Java Serialization.