Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

3.3 Design of HDFS & Core Concepts

Lesson 19 of 36 in the free Big Data-1 notes on Siksha Sarovar, written by Rohit Jangra.

3.3.1 The HDFS Design Philosophy

The Hadoop Distributed File System (HDFS) is designed to store very large files across machines in a large cluster. It prioritizes Throughput over Latency.

  • Large Data Sets: Files are typically in the range of gigabytes to terabytes.
  • Streaming Data Access: Designed for "Write Once, Read Many times." It works best for batch processing rather than interactive user applications.
  • Hardware Failure: Assume nodes will crash. HDFS is built to be self-healing.

3.3.2 HDFS Key Concepts

1. Blocks

HDFS splits a file into large chunks, called Blocks.

  • Size: Default is typically 128 MB (much larger than the 4 KB blocks in a traditional OS).
  • Reason: To minimize the "seek time" and costs associated with metadata lookup for millions of small pieces.

2. Namenode and Datanodes

HDFS follows a Master/Slave architecture.

ComponentResponsibility
NamenodeThe Master. Manages the file system namespace and metadata (where blocks are located). It is a Single Point of Failure (SPOF) in older versions.
DatanodeThe Slave. Stores the actual data blocks. They periodically send "Heartbeats" to the Namenode to say "I am alive."

3.3.3 The Namenode's Persistence: FsImage and Edit Logs

Since the Namenode stores all metadata in memory for speed, it needs a way to persist this to disk.

  1. FsImage: A complete snapshot of the file system namespace at a specific point in time.
  2. Edit Log: A small file that records every recent change (creating a file, deleting a folder) made to the file system.
  • The Restart Process: On start, the Namenode reads the FsImage and then "replays" all the transactions in the Edit Log to get back to the current state.

3.3.4 The Secondary Namenode (The Checkpointer)

Contrary to its name, it is NOT a backup for the Namenode. Its main job is to merge the FsImage and Edit Log periodically.

  • Why?: If the Edit Log grows too large, the Namenode will take hours to restart.
  • How?: The Secondary Namenode pulls the FsImage and Edit Log, merges them into a "new" FsImage, and sends it back to the Primary Namenode.

3.3.5 Why 128 MB blocks?

In a typical OS, blocks are 4 KB. If HDFS used 4 KB blocks:

  • Metadata Explosion: To store a 100 GB file, the Namenode would need to track 25 million blocks, crashing its memory.
  • Seek Time: With 128 MB, the time to transfer the data is much larger than the time to find (seek) the block on the disk, making the transfer 99% efficient.
  • Rack Awareness: Hadoop tries to put copies on different "Racks" (collections of servers) so that even if a power switch for an entire rack fails, the data is still accessible from another rack.

3.3.6 Advanced HDFS: Erasure Coding vs Replication

Standard replication (3x) has a 200% storage overhead. Modern Hadoop (3.x) uses Erasure Coding.

  • Concept: Instead of full copies, it uses mathematical parity (like RAID-6).
  • Overhead: Reduces overhead from 200% down to 50% while maintaining the same level of fault tolerance.
  • Trade-off: Requires much more CPU to recalculate data if a node fails.

3.3.7 Monitoring and Management

Administrators manage HDFS using several tools:

  • Namenode UI: A web interface (port 9870) that shows cluster health, safe mode status, and dead nodes.
  • JMX Metrics: Java Management Extensions allow systems like Prometheus to monitor JVM memory usage and RPC latency.

3.3.8 High Availability (HA) Architecture

In Hadoop 1.x, the NameNode was a Single Point of Failure. In 2.x and 3.x, we use Active/Standby models.

  1. Shared Edits Directory: Both NameNodes have access to a shared storage (like NFS) or a cluster of Quorum Journal Managers (QJM).
  2. ZK Failover Controller (ZKFC): A ZooKeeper client that monitors NameNode health and handles the automatic transition from Standby to Active if the Primary fails.
  3. Fencing: Ensuring that the "old" Active NameNode is actually dead before the new one takes over, preventing "Split Brain" scenarios where two masters try to write to the cluster.