Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

12. Multilayer Neural Networks & Backpropagation

Lesson 15 of 22 in the free Machine Learning II notes on Siksha Sarovar, written by Rohit Jangra.

12. Multilayer Neural Networks and Backpropagation

Multilayer Perceptrons (MLPs) overcome the linearity limitation of single perceptrons by stacking layers of neurons with non-linear activations. The critical breakthrough enabling training was the backpropagation algorithm, popularized by Rumelhart, Hinton, and Williams (1986), which efficiently computes gradients via the chain rule.

Architecture

An MLP with L layers computes:

  • Layer 0: a^(0) = x (input)
  • Layer l: z^(l) = W^(l) * a^{(l-1)} + b^(l), then a^(l) = f(z^(l))
  • Output: y_hat = a^(L)

Where W^(l) is the weight matrix, b^(l) is the bias, and f is the activation function.

Backpropagation

The backpropagation algorithm computes gradients by applying the chain rule backward through the network:

  1. Forward pass: Compute all activations a^(l) and store them
  2. Output error: delta^(L) = (y_hat - y) * f'(z^(L))
  3. Backpropagate: delta^(l) = (W^{(l+1)})^T delta^{(l+1)} f'(z^(l))
  4. Gradients: dL/dW^(l) = delta^(l) * (a^{(l-1)})^T, dL/db^(l) = delta^(l)
  5. Update: W^(l) = W^(l) - eta * dL/dW^(l)

Loss Functions

TaskLoss FunctionFormula
RegressionMSE(1/n) sum(y - y_hat)^2
Binary classificationBinary Cross-Entropy-[ylog(y_hat) + (1-y)log(1-y_hat)]
Multi-classCategorical Cross-Entropy-sum(y_i * log(y_hat_i))

Vanishing Gradient Problem

In deep networks with sigmoid/tanh activations, gradients can shrink exponentially through layers, making earlier layers learn very slowly: dL/dW^(1) = delta^(L) product(W^(l) f'(z^(l))) -> 0 as L increases

Solutions: ReLU activations, batch normalization, residual connections (skip connections).

Common Pitfalls

  • Symmetry breaking: initialize weights randomly (not all zeros) or gradients all cancel
  • Learning rate too high: divergence; too low: very slow convergence
  • Validation loss stops improving while training loss decreases: classic overfitting

Exam-Ready Summary

  • Backpropagation: chain rule applied backward through the computation graph
  • Forward pass: compute activations; backward pass: compute gradients
  • Vanishing gradient: deep networks with saturating activations learn slowly
  • ReLU solves vanishing gradient but introduces dying ReLU (fixed by Leaky ReLU)
  • Early stopping, dropout, and L2 regularization are standard overfitting remedies