Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

Unit 5: Parallel Processing, Flynn's Taxonomy & Cache Coherence

Lesson 16 of 17 in the free Logical Organization of Computer-II notes on Siksha Sarovar, written by Rohit Jangra.

3.1 Why Parallelism Became Mandatory

Until ~2005, performance came from higher clock rates — until power (∝ frequency × voltage²) hit the power wall, and deeper pipelines stopped paying (limited ILP). The industry pivoted: instead of one faster core, put many cores on one die. Parallel processing is thus not an optimisation but the only remaining scaling path — the framing examiners reward.

3.2 Flynn's Taxonomy

Flynn (1966) classified architectures by the multiplicity of instruction streams and data streams:

ClassInstruction StreamsData StreamsDescriptionExamples
SISD11classic uniprocessor (may still pipeline)8086, early ARM cores
SIMD1manyone instruction applied to many data elements in lockstepGPUs, vector units (SSE/AVX/NEON), array processors
MISDmany1many instructions on one data stream — rare/degeneratesystolic arrays; fault-tolerant voting (Space Shuttle) is the textbook example
MIMDmanymanyindependent processors on independent datamulticore CPUs, clusters, supercomputers
SISD:  CU -> PU -> data            SIMD:      CU
                                          /   |   \
MIMD:  CU1->PU1->data1                 PU1   PU2   PU3   (same instruction,
       CU2->PU2->data2                  |     |     |     different data)
       CU3->PU3->data3                 d1    d2    d3

Why SIMD is so power-efficient: one fetch/decode is amortised over N data elements — ideal for graphics, ML and multimedia, which is exactly where GPUs live.

3.3 Multiprocessor vs Multicore vs Multicomputer

AspectMulticoreMultiprocessor (SMP)Multicomputer (cluster)
Placementmany cores, one chipmany chips, one machinemany machines
Memoryshared (common address space)shareddistributed — message passing
Communicationvia shared caches/memoryvia system bus/interconnectvia network (MPI)
Couplingtightly coupledtightly coupledloosely coupled
ExampleIntel Core i7, Apple M-seriesmulti-socket serversBeowulf clusters, cloud data centres

Shared-memory machines subdivide into UMA (uniform memory access — every core sees equal latency) and NUMA (each socket has local memory; remote memory is slower — OS must place data near its user).

3.4 The Cache Coherence Problem

Private per-core caches — mandatory for speed — create a correctness hazard:

X = 5 in memory.
P1 reads X (caches 5).  P2 reads X (caches 5).
P1 writes X = 10.
   Write-through: memory = 10, but P2's cache still says 5!
   Write-back  : memory ALSO still says 5.
P2 reads X -> gets stale 5.  Incoherent.

Coherence demands that all processors agree on the order of writes to each location. Two write-propagation strategies: write-invalidate (before writing, broadcast an invalidation so all other copies die; later readers re-fetch — the common choice, less traffic when one core writes repeatedly) vs write-update (broadcast the new value to all sharers — good for tight producer–consumer, wasteful otherwise).

Two enforcement structures:

  • Snooping: every cache watches ("snoops") a shared bus; a write appears on the bus and other caches invalidate/update their copy. Simple, but the broadcast bus saturates beyond ~8–16 cores.
  • Directory-based: a directory entry per memory block records which cores hold copies; on a write, invalidations go only to listed sharers. Scales to large NUMA machines at the cost of directory storage and latency.

Protocols are named by their line states — MESI: Modified (dirty, exclusive), Exclusive (clean, only copy), Shared (clean, others may hold it), Invalid. One-line-per-state is enough at this level, plus the term false sharing (two cores fighting over different variables that share one cache line).

3.5 The Limit: Amdahl's Law

If fraction f of a program is parallelisable over p processors: S = 1 / ((1 − f) + f/p).

Example: f = 0.9, p = 4 → S = 1 / (0.1 + 0.225) = 1 / 0.325 = 3.08; with p → ∞, S → 1/0.1 = 10. The serial 10% caps speedup at 10× forever — the single most important sanity check in parallel computing, and a guaranteed numerical.

🎯 Exam Focus

  1. Explain Flynn's taxonomy with a labelled diagram and one real example per class.
  2. Why is MISD considered rare or degenerate? Which real systems are cited as MISD?
  3. Differentiate multicore, multiprocessor and multicomputer systems; define UMA vs NUMA.
  4. Illustrate the cache coherence problem with a two-processor example, and compare write-invalidate with write-update.
  5. Contrast snooping and directory-based coherence; why does snooping fail to scale?
  6. A program is 80% parallelisable. Compute the speedup on 4 and on 16 processors, and its upper bound.