Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

6. Web Scraping

Lesson 6 of 21 in the free Machine Learning notes on Siksha Sarovar, written by Rohit Jangra.

What is Web Scraping?

Web scraping is the automated process of extracting information from websites. It turns unstructured data on the web (HTML) into structured data (like a database or CSV) that can be analyzed or used in machine learning models.

Legal & Ethical Note: Always check a website's robots.txt file (e.g., google.com/robots.txt) and Terms of Service before scraping. Respect rate limits to avoid crashing servers.

Key Python Libraries

  1. Requests: The standard for sending HTTP requests (GET/POST) to fetch web pages.
  2. Beautiful Soup (bs4): Excellent for parsing HTML and navigating the parse tree (finding tags like <div>, <table>).
  3. Selenium: Used for scraping dynamic (JavaScript-heavy) websites by automating a real web browser.

The Scraping Workflow

  1. Inspect: Use browser Developer Tools (F12) to find the HTML structure of the data.
  2. Request: Fetch the page syntax using requests.
  3. Parse: Extract specific elements using BeautifulSoup.
  4. Store: Save the data to CSV, JSON, or a Database.