Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

3.1 Data Format & Analyzing Data with Hadoop

Lesson 17 of 36 in the free Big Data-1 notes on Siksha Sarovar, written by Rohit Jangra.

3.1.1 The Challenge of Diverse Data Formats

In the Big Data world, data arrives in various formats—from structured logs to unstructured social media feeds. Hadoop must be able to parse and analyze this "data swamp" efficiently.

Format CategoryDescriptionExamples
Text-BasedHuman-readable but bulky.CSV, JSON, XML, TXT.
Sequence FilesBinary files for Key-Value pairs.Native Hadoop format.
ColumnarOptimized for analytical queries.Parquet, ORC.
AvroSchema-based serialization.Data exchange.

3.1.2 Analyzing Data with Hadoop

Analysis in Hadoop typically follows a "Divide and Conquer" approach using the MapReduce paradigm.

  • Data Locality: One of Hadoop's core innovations. It moves the code to the data, rather than moving massive datasets to a central server. This minimizes network traffic.
  • Logical vs. Physical View: Users see a single giant file, but physically, Hadoop sees blocks distributed across 1,000 machines.
  • Job Tracker & Task Tracker: (Classic Hadoop) The system breaks the analysis into "tasks" and monitors their completion.

3.1.3 Scaling Out (The Horizontal Path)

Scaling Out refers to adding more machines to a cluster, rather than making existing machines more powerful.

  1. Linear Scalability: Doubling the cluster size ideally halves the processing time.
  2. Commodity Hardware: Hadoop is designed to run on cheap, standard servers. It assumes hardware will fail and builds reliability into the software.
  3. High Availability: By distributing data blocks (Replication), Hadoop ensures that the loss of a few servers doesn't stop the analysis.