Scikit-learn: Machine Learning in Python
Definition: Scikit-learn (sklearn) is the most popular machine learning library in Python. It provides simple and efficient tools for data mining, data analysis, and machine learning — including classification, regression, clustering, dimensionality reduction, and model evaluation.
import sklearn
---
Why Scikit-learn?
| Feature | Benefit |
|---|---|
| Simple API | Consistent interface: fit(), predict(), transform() |
| Wide Coverage | Covers most ML algorithms |
| Well-Documented | Excellent documentation with examples |
| Integration | Works with NumPy, Pandas, Matplotlib |
| Production Ready | Used in industry and academia |
---
The Scikit-learn Workflow
1. Import Data → 2. Preprocess → 3. Split (Train/Test)
→ 4. Choose Model → 5. Train (fit) → 6. Predict
→ 7. Evaluate → 8. Tune Hyperparameters
---
Categories of ML Algorithms in Scikit-learn
| Category | Type | Algorithms | Use Case |
|---|---|---|---|
| Supervised - Classification | Predicts categories | Logistic Regression, Decision Tree, Random Forest, SVM, KNN | Spam detection, disease diagnosis |
| Supervised - Regression | Predicts continuous values | Linear Regression, Ridge, Lasso, Decision Tree Regressor | House price prediction, sales forecast |
| Unsupervised - Clustering | Groups similar data | K-Means, DBSCAN, Hierarchical | Customer segmentation |
| Unsupervised - Dimensionality Reduction | Reduces features | PCA, t-SNE, LDA | Visualization, feature reduction |
---
Data Preprocessing
Train-Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Feature Scaling
| Method | Class | When to Use |
|---|---|---|
| Standardization | StandardScaler() | When data is normally distributed |
| Normalization | MinMaxScaler() | When you need values in [0, 1] |
| Robust Scaling | RobustScaler() | When data has outliers |
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Label Encoding (Categorical → Numeric)
| Method | Class | Example |
|---|---|---|
| Label Encoding | LabelEncoder() | Male → 0, Female → 1 |
| One-Hot Encoding | OneHotEncoder() | Red → [1,0,0], Green → [0,1,0] |
---
The Universal Scikit-learn API
Every Scikit-learn model follows the same pattern:
from sklearn.model_name import ModelClass
# 1. Create model
model = ModelClass(hyperparameters)
# 2. Train
model.fit(X_train, y_train)
# 3. Predict
y_pred = model.predict(X_test)
# 4. Evaluate
score = model.score(X_test, y_test)
---
Key Algorithms at a Glance
| Algorithm | Type | Pros | Cons |
|---|---|---|---|
| Linear Regression | Regression | Simple, interpretable | Assumes linearity |
| Logistic Regression | Classification | Fast, good baseline | Linear decision boundary |
| Decision Tree | Both | Easy to visualize | Overfits easily |
| Random Forest | Both | Handles overfitting, versatile | Slower, less interpretable |
| SVM | Both | Effective in high dimensions | Slow on large datasets |
| KNN | Both | No training needed | Slow prediction, memory heavy |
| K-Means | Clustering | Simple, scalable | Need to specify K |
---
Model Evaluation
Classification Metrics
| Metric | Description | When to Use |
|---|---|---|
| Accuracy | Correct predictions / Total | Balanced classes |
| Precision | TP / (TP + FP) | Minimize false positives (spam detection) |
| Recall | TP / (TP + FN) | Minimize false negatives (disease detection) |
| F1 Score | Harmonic mean of Precision & Recall | Imbalanced classes |
| Confusion Matrix | TP, TN, FP, FN table | Detailed error analysis |
Regression Metrics
| Metric | Description |
|---|---|
| MAE (Mean Absolute Error) | Average absolute difference |
| MSE (Mean Squared Error) | Average squared difference |
| RMSE (Root MSE) | Square root of MSE (same unit as target) |
| R² Score | Proportion of variance explained (0 to 1) |
---
Cross-Validation
Instead of a single train-test split, cross-validation uses multiple splits for more reliable evaluation:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"Mean Accuracy: {scores.mean():.2f} ± {scores.std():.2f}")
Summary
- Scikit-learn provides a consistent API for all ML algorithms:
fit(),predict(),score(). - Preprocessing (scaling, encoding, splitting) is critical before training.
- Classification, regression, clustering, and dimensionality reduction are all supported.
- Model evaluation metrics (accuracy, precision, recall, F1, R²) guide model selection.
- Cross-validation provides more reliable performance estimates than a single split.