Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

Levels of Parallelism

Lesson 22 of 30 in the free Cloud Computing notes on Siksha Sarovar, written by Rohit Jangra.

Levels of Parallelism

Parallelism exists at multiple levels of abstraction in a computing system, from the width of a single data bus up to entire data centers. Understanding each level helps you match your parallelization strategy to the appropriate hardware feature.

The Four Levels

Level 1: Bit-Level Parallelism

The processor's word width determines how many bits are processed per operation. A 64-bit CPU can add two 64-bit integers in a single clock cycle; an 8-bit CPU would need 8 add operations for the same.

  • Evolution: 8-bit (Intel 8080) → 16-bit (8086) → 32-bit (80386) → 64-bit (x86-64, ARM64)
  • Implication: Wider words = more data processed per cycle, no programmer effort required
  • Relevance to cloud: All modern cloud VMs are 64-bit; some ML hardware (Google TPU) uses 128-bit accumulators internally

Level 2: Instruction-Level Parallelism (ILP)

The CPU executes multiple instructions simultaneously by exploiting independence between them. This is largely automatic (done by the hardware and compiler).

Techniques:

  • Pipelining: Divide instruction execution into stages (fetch, decode, execute, write-back). While one instruction is in the execute stage, the next is being decoded. A 5-stage pipeline can have 5 instructions in-flight simultaneously.
  • Superscalar execution: Issue multiple instructions per clock cycle (modern CPUs issue 4–6 per cycle).
  • Out-of-order execution: Reorder instructions at runtime to avoid stalls caused by data dependencies.
  • Branch prediction: Speculatively execute instructions past a branch to avoid pipeline stalls.

Hardware example: Intel Core i9-13900K has a 20-stage pipeline and can retire up to 6 instructions per cycle.

Level 3: Data-Level Parallelism (DLP)

Apply one operation to multiple data elements simultaneously using vector registers or GPU warps.

  • CPU SIMD: Intel AVX-512 registers are 512 bits wide — process 16 single-precision floats or 8 doubles per instruction. A loop that previously took 1,000 cycles might take 63 (16× speedup).
  • GPU warps: NVIDIA groups 32 GPU threads into a warp. All 32 execute the same instruction simultaneously (SIMT — Single Instruction Multiple Threads). An A100 has 108 streaming multiprocessors, each running many warps concurrently.
  • Cloud DLP: Apache Spark's RDD partitions distribute data slices across workers — each worker applies the same transformation to its partition (conceptually data-parallel at cluster scale).

Level 4: Task-Level Parallelism (TLP)

Independent tasks execute concurrently, potentially on different processors, potentially running entirely different code paths.

MechanismScopeExample
Multi-threadingOne process, shared memoryJava thread pool serving HTTP requests
Multi-processingMultiple OS processesPython multiprocessing, Nginx workers
Distributed tasksMultiple machinesKubernetes pods, AWS Lambda functions

Task-level parallelism is the primary source of scalability in cloud applications. A microservices architecture with 50 independently deployable services is task-level parallelism at the system design level.

Mapping to Hardware

LevelHardware FeatureProgrammer Control
BitALU widthNone (choose 64-bit platform)
InstructionPipeline, superscalar, OOOIndirect (compiler flags, loop structure)
DataSIMD units, GPUExplicit (intrinsics, CUDA) or auto-vectorization
TaskCores, nodesExplicit (threads, processes, MPI)

Key Insight

Maximizing performance requires exploiting all four levels simultaneously. A well-optimized deep learning kernel (e.g., cuBLAS matrix multiply) uses: 64-bit addressing (bit), pipelined GPU instructions (instruction), AVX on CPU host + GPU warps (data), and multiple GPU streams and distributed nodes (task).