Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

3.4 Data Flow and the Java Interface

Lesson 20 of 36 in the free Big Data-1 notes on Siksha Sarovar, written by Rohit Jangra.

3.4.1 The HDFS Java Interface

Hadoop is written in Java, and its API is the most powerful way to interact with HDFS. The core class is org.apache.hadoop.fs.FileSystem.

Key Commands (Conceptual):

  • open(): Opens a stream to read a file.
  • create(): Creates a new file.
  • delete(): Removes a file or directory.
  • listStatus(): Lists the files in a directory.

3.4.2 The HDFS Command Line Interface

While Java is the internal engine, most users interact via the shell.

  • hdfs dfs -ls /: List the root directory.
  • hdfs dfs -put localfile /hdfsdir: Upload a file from your computer to the cluster.
  • hdfs dfs -get /hdfsfile localdir: Download a file from the cluster.
  • hdfs dfs -du -h /data: Show disk usage in a human-readable format.
  • hdfs dfs -setrep -w 5 /critical_data: Change the replication factor for a file and wait for it to complete.

3.4.3 Anatomy of a File Read/Write

How does data actually move between the client and the cluster?

The Read Flow:

  1. Request: Client asks the Namenode for the block locations of a file.
  2. Metadata: Namenode returns a list of Datanodes that have the blocks, sorted by distance from the client.
  3. Direct Transfer: Client connects directly to the nearest Datanode to stream the data. This keeps the Namenode from becoming a bottleneck.
  1. Ack: Once all three acknowledge, the client starts the next block.

3.4.4 HDFS Storage Policies

Modern Hadoop 3.x allows for "Heterogeneous Storage"—using different types of hardware for different data.

  • HOT: All replicas on Disk (Default).
  • COLD: All replicas on Archival storage (Tape/Cheap HDD).
  • WARM: Some replicas on Disk, some on Archival.
  • ALL_SSD: For high-performance, low-latency applications.
  • LAZY_PERSIST: Write to RAM first, then asynchronously to Disk.

3.4.5 HDFS Snapshotting

A Snapshot is a read-only point-in-time copy of the file system.

  • Efficient: It doesn't actually copy the data. It only records the metadata and prevents the blocks from being deleted or modified (Copy-on-Write).
  • Use Case: Accidentally deleting a database folder? You can recover it instantly from last night's snapshot.