3.1 Why Parallelism Became Mandatory
Until ~2005, performance came from higher clock rates — until power (∝ frequency × voltage²) hit the power wall, and deeper pipelines stopped paying (limited ILP). The industry pivoted: instead of one faster core, put many cores on one die. Parallel processing is thus not an optimisation but the only remaining scaling path — the framing examiners reward.
3.2 Flynn's Taxonomy
Flynn (1966) classified architectures by the multiplicity of instruction streams and data streams:
| Class | Instruction Streams | Data Streams | Description | Examples |
|---|---|---|---|---|
| SISD | 1 | 1 | classic uniprocessor (may still pipeline) | 8086, early ARM cores |
| SIMD | 1 | many | one instruction applied to many data elements in lockstep | GPUs, vector units (SSE/AVX/NEON), array processors |
| MISD | many | 1 | many instructions on one data stream — rare/degenerate | systolic arrays; fault-tolerant voting (Space Shuttle) is the textbook example |
| MIMD | many | many | independent processors on independent data | multicore CPUs, clusters, supercomputers |
SISD: CU -> PU -> data SIMD: CU
/ | \
MIMD: CU1->PU1->data1 PU1 PU2 PU3 (same instruction,
CU2->PU2->data2 | | | different data)
CU3->PU3->data3 d1 d2 d3
Why SIMD is so power-efficient: one fetch/decode is amortised over N data elements — ideal for graphics, ML and multimedia, which is exactly where GPUs live.
3.3 Multiprocessor vs Multicore vs Multicomputer
| Aspect | Multicore | Multiprocessor (SMP) | Multicomputer (cluster) |
|---|---|---|---|
| Placement | many cores, one chip | many chips, one machine | many machines |
| Memory | shared (common address space) | shared | distributed — message passing |
| Communication | via shared caches/memory | via system bus/interconnect | via network (MPI) |
| Coupling | tightly coupled | tightly coupled | loosely coupled |
| Example | Intel Core i7, Apple M-series | multi-socket servers | Beowulf clusters, cloud data centres |
Shared-memory machines subdivide into UMA (uniform memory access — every core sees equal latency) and NUMA (each socket has local memory; remote memory is slower — OS must place data near its user).
3.4 The Cache Coherence Problem
Private per-core caches — mandatory for speed — create a correctness hazard:
X = 5 in memory.
P1 reads X (caches 5). P2 reads X (caches 5).
P1 writes X = 10.
Write-through: memory = 10, but P2's cache still says 5!
Write-back : memory ALSO still says 5.
P2 reads X -> gets stale 5. Incoherent.
Coherence demands that all processors agree on the order of writes to each location. Two write-propagation strategies: write-invalidate (before writing, broadcast an invalidation so all other copies die; later readers re-fetch — the common choice, less traffic when one core writes repeatedly) vs write-update (broadcast the new value to all sharers — good for tight producer–consumer, wasteful otherwise).
Two enforcement structures:
- Snooping: every cache watches ("snoops") a shared bus; a write appears on the bus and other caches invalidate/update their copy. Simple, but the broadcast bus saturates beyond ~8–16 cores.
- Directory-based: a directory entry per memory block records which cores hold copies; on a write, invalidations go only to listed sharers. Scales to large NUMA machines at the cost of directory storage and latency.
Protocols are named by their line states — MESI: Modified (dirty, exclusive), Exclusive (clean, only copy), Shared (clean, others may hold it), Invalid. One-line-per-state is enough at this level, plus the term false sharing (two cores fighting over different variables that share one cache line).
3.5 The Limit: Amdahl's Law
If fraction f of a program is parallelisable over p processors: S = 1 / ((1 − f) + f/p).
Example: f = 0.9, p = 4 → S = 1 / (0.1 + 0.225) = 1 / 0.325 = 3.08; with p → ∞, S → 1/0.1 = 10. The serial 10% caps speedup at 10× forever — the single most important sanity check in parallel computing, and a guaranteed numerical.
🎯 Exam Focus
- Explain Flynn's taxonomy with a labelled diagram and one real example per class.
- Why is MISD considered rare or degenerate? Which real systems are cited as MISD?
- Differentiate multicore, multiprocessor and multicomputer systems; define UMA vs NUMA.
- Illustrate the cache coherence problem with a two-processor example, and compare write-invalidate with write-update.
- Contrast snooping and directory-based coherence; why does snooping fail to scale?
- A program is 80% parallelisable. Compute the speedup on 4 and on 16 processors, and its upper bound.