Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

Introduction to Big Data

Lesson 6 of 37 in the free Data Science notes on Siksha Sarovar, written by Rohit Jangra.

Introduction to Big Data

The term "Big Data" refers to datasets that are so large, fast-moving, or complex that they cannot be processed or analyzed using traditional data management tools or methods. The concept emerged because traditional relational databases (like MySQL or PostgreSQL) and spreadsheet tools (like Excel) simply cannot handle the volume, velocity, and variety of modern data.

Why Traditional Tools Fail

Consider these scenarios:

  • A social media platform generates 500 million tweets per day. Excel cannot open a file with billions of rows.
  • A stock exchange generates millions of transactions per second. Traditional databases cannot process this in real-time.
  • YouTube receives 720,000 hours of video uploads daily. This data is unstructured and cannot be stored in a simple table.

Big Data technologies (like Hadoop and Spark) were invented specifically to handle these challenges.

---

The 5 Vs of Big Data

Traditionally, Big Data was defined by 3 Vs (Volume, Velocity, Variety). Modern definitions have expanded this to 5 Vs to capture the full picture:

VNameDescriptionExample
V1VolumeThe sheer amount of data generatedFacebook stores over 300 petabytes of data
V2VelocityThe speed at which data is generated and must be processedStock market data streams in milliseconds
V3VarietyThe different types and formats of dataText, images, videos, sensor readings, GPS data
V4VeracityThe trustworthiness and accuracy of dataSocial media posts may contain misinformation
V5ValueThe usefulness of the data after processingRaw data is useless; insights have value

---

Big Data Ecosystem & Technologies

To handle Big Data, a specialized ecosystem of technologies has been developed:

Storage Technologies:

  • HDFS (Hadoop Distributed File System): Distributes data across multiple machines for fault-tolerant storage.
  • Amazon S3: Cloud-based object storage by AWS.
  • Google Cloud Storage / Azure Blob Storage: Cloud equivalents from Google and Microsoft.

Processing Frameworks:

  • Apache Hadoop: The foundational Big Data framework using MapReduce for batch processing.
  • Apache Spark: Up to 100x faster than Hadoop for in-memory processing. Supports batch, streaming, ML, and graph processing.
  • Apache Flink: Real-time stream processing.

Query Engines:

  • Apache Hive: SQL-like querying on Hadoop data.
  • Google BigQuery: Serverless, highly scalable data warehouse.
  • Presto: Distributed SQL query engine.

Streaming Technologies:

  • Apache Kafka: Distributed event streaming platform for real-time data feeds.
  • Apache Storm: Real-time computation system.

Big Data Technology Comparison

TechnologyTypeSpeedBest For
Hadoop (MapReduce)Batch ProcessingSlower (Disk-based)Large-scale batch jobs
Apache SparkBatch + StreamingFast (In-memory)General-purpose analytics
Apache KafkaStreamingReal-timeEvent-driven architectures
Apache FlinkStreamingReal-timeComplex event processing
Google BigQueryServerless DWFastAd-hoc SQL analytics

---

Big Data in Everyday Life

  1. Google Search: Processes over 8.5 billion searches per day, using Big Data to rank results.
  2. Netflix: Analyzes viewing habits of 230+ million subscribers to power recommendations.
  3. Weather Forecasting: Satellites and sensors generate terabytes of atmospheric data daily, processed using Big Data tools.
  4. Smart Cities: IoT sensors monitor traffic, air quality, and energy usage in real-time.

Summary

  • Big Data is data that exceeds the capacity of traditional tools due to its volume, velocity, and variety.
  • The 5 Vs (Volume, Velocity, Variety, Veracity, Value) define its characteristics.
  • Specialized tools like Hadoop, Spark, and Kafka are required to process Big Data.
  • Big Data is ubiquitous in modern life—from search engines to healthcare to smart cities.