2.1 The Idea: An Assembly Line for Instructions
Without pipelining, instruction i+1 waits for instruction i to finish completely. Pipelining splits execution into k stages separated by latch registers and overlaps k instructions at once — like a laundry with separate washer, dryer and folding table. It improves throughput (instructions completed per second), not the latency of any single instruction — the most repeated conceptual trap.
The classic 5-stage RISC pipeline:
| Stage | Name | Work |
|---|---|---|
| IF | Instruction Fetch | read instruction, PC += 4 |
| ID | Instruction Decode | decode + read registers |
| EX | Execute | ALU operation / address calc |
| MEM | Memory | load/store data access |
| WB | Write Back | write result to register file |
Cycle: 1 2 3 4 5 6 7 8
Instr 1: IF ID EX MEM WB
Instr 2: IF ID EX MEM WB
Instr 3: IF ID EX MEM WB
Instr 4: IF ID EX MEM WB
After the pipe fills, one instruction completes every cycle.
2.2 Speedup Arithmetic (Guaranteed Numerical)
k stages, clock period tp, n tasks: pipelined time = (k + n − 1) · tp (k cycles to fill, then n − 1 more completions). Non-pipelined time = n · k · tp (or n · tn).
S = n.k.tp / (k + n - 1).tp = n.k / (k + n - 1)
Example: k = 4, tp = 20 ns, n = 100
Pipelined = (4 + 99) x 20 = 103 x 20 = 2060 ns
Non-pipelined = 100 x 4 x 20 = 8000 ns
Speedup S = 8000 / 2060 = 3.88
As n → ∞, S → k (here 4): maximum speedup = number of stages. Efficiency = S/k = 3.88/4 = 97%; throughput = n / time = 100 / 2060 ns ≈ 48.5 million instructions/s. Why the ideal is never reached: fill/drain overhead, latch delay added to tp, unequal stage lengths (clock is set by the slowest stage), and hazards below.
2.3 The Three Hazards
1. Structural hazard — hardware conflict. Two stages need the same resource, e.g., IF and MEM both accessing a single memory port. Cures: separate instruction/data caches (Harvard style), duplicated units, or a stall.
2. Data hazard — value not ready (RAW).
ADD R1, R2, R3 ; R1 written in WB (cycle 5)
SUB R4, R1, R5 ; R1 needed in ID/EX (cycle 3!)
Cures, in order of elegance:
- Forwarding (bypassing): route the ALU output of ADD straight into SUB's ALU input — hazard vanishes with zero stalls in most cases.
- Load-use exception: a value loaded from memory is ready only after MEM, so even with forwarding a dependent next instruction stalls one cycle — beloved trap.
- Stalls (bubbles): hardware interlock inserts NOPs until the value is ready.
- Compiler scheduling: reorder independent instructions into the gap.
3. Control hazard — where next? A branch decides its target late, so the instructions fetched behind it may be wrong. Cures:
- Stall until the branch resolves (simple, slow).
- Delayed branch: always execute the instruction in the "delay slot" (classic MIPS; compiler fills the slot).
- Branch prediction: 1-bit predictor mispredicts twice per loop (entry and exit); a 2-bit saturating counter requires two consecutive misses to flip, mispredicting once per loop — explain-the-difference is a standard 5-marker.
Penalty arithmetic: base CPI = 1, branches are 20% of instructions, misprediction penalty 3 cycles, all mispredicted → effective CPI = 1 + 0.2 × 3 = 1.6; with a predictor that is 90% accurate, CPI = 1 + 0.2 × 0.1 × 3 = 1.06.
🎯 Exam Focus
- Explain the five-stage instruction pipeline with a space–time diagram for 6 instructions.
- A 4-segment pipeline with 25 ns clock processes 200 tasks: compute pipelined time, non-pipelined time, speedup, efficiency and throughput.
- Define structural, data and control hazards with one example and one remedy each.
- What is operand forwarding? Show how it removes the RAW stall between ADD R1,R2,R3 and SUB R4,R1,R5, and why a load-use case still stalls once.
- Compare 1-bit and 2-bit branch prediction for a loop that runs 100 iterations.
- Why can pipelining never achieve speedup equal to k in practice? Give three reasons.