Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

Data Science Lifecycle

Lesson 3 of 37 in the free Data Science notes on Siksha Sarovar, written by Rohit Jangra.

Data Science Lifecycle

The Data Science Lifecycle is the structured, iterative process that a data science project follows from inception to deployment. It is not a strictly linear process — teams often cycle back to earlier stages as they discover new information or encounter data quality issues.

Why is Understanding the Lifecycle Important?

Without a clear process, data science projects can become chaotic. The lifecycle provides a roadmap, ensuring that:

  • The right problem is being solved.
  • Data is properly prepared before modeling.
  • Results are validated and communicated effectively.
  • Solutions are deployed and maintained over time.

---

The 7 Stages of the Data Science Lifecycle

Stage 1: Business Understanding (Problem Definition)

  • Goal: Clearly define the business problem and translate it into a data science question.
  • This is the most critical and often overlooked stage. A poorly defined problem leads to wasted effort and irrelevant models.

Key Activities:

  • Meeting stakeholders to understand objectives.
  • Defining success criteria (e.g., "Reduce customer churn by 15%").
  • Determining if the problem can actually be solved with data.

Example:

A telecom company experiences high customer churn. The business question becomes: "Can we predict which customers are likely to leave in the next 30 days?"

---

Stage 2: Data Acquisition (Data Mining)

  • Goal: Gather relevant data from all available sources.

Data Sources Include:

  • Internal Databases (SQL, NoSQL)
  • APIs (Twitter API, Weather API)
  • Web Scraping
  • Public Datasets (Kaggle, Government portals)
  • Sensor/IoT Data

Key Considerations:

  • Data volume, quality, and freshness.
  • Legal and ethical constraints (privacy laws like GDPR).
  • Data access permissions and security.

---

Stage 3: Data Cleaning & Preparation

  • Goal: Transform raw data into a clean, usable format.
  • This stage typically consumes 60-80% of a data scientist's time.

Common Tasks:

TaskDescriptionExample
Handling Missing ValuesDeciding what to do with empty cellsFill with mean, median, or drop the row
Removing DuplicatesIdentifying and deleting repeated recordsTwo entries for the same customer
Fixing Data TypesEnsuring columns have correct typesConverting "123" (string) to 123 (integer)
Outlier TreatmentDealing with extreme valuesAn age of 999 is clearly an error
Normalization/ScalingPutting features on a comparable scaleScaling salary (in thousands) and age (in years)

---

Stage 4: Exploratory Data Analysis (EDA)

  • Goal: Understand the data through visualization and summary statistics.
  • This is the "detective work" phase where you look for patterns, anomalies, and relationships.

Key Techniques:

  • Descriptive Statistics: Mean, Median, Mode, Standard Deviation.
  • Correlation Analysis: Identifying relationships between features (e.g., Does income correlate with spending?).
  • Data Visualization: Histograms, Box plots, Scatter plots, Heatmaps.

---

Stage 5: Feature Engineering & Modeling

  • Goal: Select and transform the best features, then train predictive models.

Feature Engineering:

  • Selecting the most relevant features (Feature Selection).
  • Creating new features from existing ones (e.g., deriving "Age" from "Date of Birth").
  • Encoding categorical variables (e.g., converting "Male/Female" into 0/1).

Model Training:

  • Splitting data into Training Set (typically 70-80%) and Test Set (20-30%).
  • Choosing appropriate algorithms (Linear Regression, Decision Trees, Neural Networks).
  • Training the model on the training set and evaluating on the test set.

---

Stage 6: Model Evaluation

  • Goal: Measure how well the model performs and whether it meets the business success criteria.

Common Evaluation Metrics:

MetricUsed ForDescription
AccuracyClassification% of correct predictions
PrecisionClassificationOf predicted positives, how many were actually positive?
RecallClassificationOf actual positives, how many did we catch?
F1-ScoreClassificationHarmonic mean of Precision and Recall
RMSERegressionAverage magnitude of prediction error
R-SquaredRegressionHow much variance is explained by the model (0 to 1)

---

Stage 7: Deployment & Communication

  • Goal: Put the model into production and communicate results to stakeholders.

Deployment Options:

  • Embedding in a web application or mobile app.
  • Running as an automated batch process (e.g., daily fraud detection).
  • Creating an API endpoint for real-time predictions.

Communication:

  • Presenting findings via dashboards (Tableau, PowerBI).
  • Writing detailed reports for non-technical stakeholders.
  • Using storytelling to make data insights digestible.

Summary Table: The Lifecycle at a Glance

StageKey QuestionPrimary Output
Business Understanding"What problem are we solving?"Problem Statement
Data Acquisition"Where do we get the data?"Raw Dataset
Data Cleaning"Is the data usable?"Clean Dataset
EDA"What does the data tell us?"Insights & Visualizations
Feature Engineering"What features matter?"Engineered Features
Modeling & Evaluation"How accurate is our prediction?"Trained Model
Deployment"How do we use this in the real world?"Live System / Report