Elements of Parallel Computing
Before writing a single line of parallel code, you must understand the building blocks: concurrency, parallelism, granularity, speedup, and efficiency. These concepts let you reason about whether a problem is parallelizable and how much gain to expect.
Concurrency vs. Parallelism
These two terms are frequently confused:
- Concurrency is the composition of independently-executing processes or tasks. It is a structural property — the program is designed so that multiple things could run at the same time. A single-core CPU can be concurrent (interleaving tasks via context switching).
- Parallelism is the simultaneous execution of multiple computations. It is a runtime property that requires multiple physical processors or cores.
"Concurrency is about dealing with lots of things at once. Parallelism is about doing lots of things at once." — Rob Pike (Go language designer)
A web server handling 1,000 requests concurrently on 4 cores is both concurrent (structured for independent tasks) and parallel (4 tasks literally run at the same moment).
Granularity
Granularity describes the ratio of computation to communication in a parallel task.
| Granularity | Characteristics | Example |
|---|---|---|
| Fine-grained | Many small tasks, frequent communication | GPU pixel shaders |
| Medium-grained | Moderate task size, some sync needed | OpenMP loop parallelism |
| Coarse-grained | Few large tasks, rare communication | MPI processes on HPC cluster |
Choosing the wrong granularity wastes time: fine-grained tasks on a slow network drown in communication overhead; coarse-grained tasks on a shared-memory machine underutilize cores.
Speedup
Speedup measures how much faster a parallel program runs compared to its sequential version:
Speedup(N) = T(1) / T(N)
Where T(1) = sequential execution time, T(N) = parallel execution time on N processors.
- Linear speedup: Speedup(N) = N (ideal, rarely achieved)
- Super-linear speedup: Speedup(N) > N (possible when parallel version benefits from cache effects)
- Sub-linear speedup: Speedup(N) < N (typical, due to overhead and serial sections)
Efficiency
Efficiency normalizes speedup by the number of processors:
Efficiency(N) = Speedup(N) / N
An efficiency of 1.0 (100%) means every processor is perfectly utilized. In practice, efficiencies of 0.7–0.9 are considered good for large-scale systems.
Parallelism Lifecycle
Dependencies: The Enemy of Parallelism
Not all computations can be parallelized. Data dependencies (one task needs the result of another) create sequential bottlenecks. Techniques to reduce dependencies:
- Loop transformations (loop unrolling, tiling)
- Pipelining (overlap stages)
- Speculative execution (guess a result, correct if wrong)
Real-World Example
Apache Spark decomposes a data processing job into a DAG (Directed Acyclic Graph) of tasks. Each stage is a coarse-grained parallel operation (e.g., map over all partitions). Within each executor (JVM process), fine-grained thread-level parallelism may be used. Spark's scheduler analyzes dependencies to determine which tasks can run concurrently — exactly the principles described here.