7.1 Why ripple-carry is the wrong adder for fast circuits
In an N-bit ripple-carry adder, Cout of stage i must propagate through every stage from 0 to i. So the worst-case carry path passes through 2 × N gates (two per full adder for carry generation). For N = 32 bits this is 64 gate delays — about 320 ns in old TTL. A modern CPU runs its adder in well under a nanosecond — we need a fundamentally different scheme.
7.2 The "generate" and "propagate" insight
For each bit position i, define two cheap signals:
G_i = A_i AND B_i // this stage GENERATES a carry no matter what
P_i = A_i XOR B_i // this stage PROPAGATES an incoming carry
Then the carry-out of any stage is
C_{i+1} = G_i OR (P_i AND C_i)
The key trick: unrolling this recursion exposes every carry directly in terms of the primary inputs — no rippling!
C_1 = G_0 + P_0·C_0
C_2 = G_1 + P_1·G_0 + P_1·P_0·C_0
C_3 = G_2 + P_2·G_1 + P_2·P_1·G_0 + P_2·P_1·P_0·C_0
C_4 = G_3 + P_3·G_2 + P_3·P_2·G_1 + P_3·P_2·P_1·G_0 + P_3·P_2·P_1·P_0·C_0
Every carry now needs only 2 gate delays to settle — one for the AND tree, one for the OR — regardless of how many stages back the originating carry came from.
7.3 The 4-bit carry-look-ahead adder
A 4-bit CLA block has:
- Four P / G generators (one per bit) — 1 gate delay.
- A look-ahead block that computes C1, C2, C3, C4 in parallel — 2 gate delays.
- Four final XORs that combine P_i with C_i to produce S_i — 1 gate delay.
Total worst-case delay: 4 gate delays, independent of width, for one CLA block.
7.4 Scaling beyond 4 bits — hierarchical CLA
To build a 16-bit or 64-bit adder, cascade 4-bit CLA blocks. Two levels of "block generate" / "block propagate" signals let the upper carries fly across the whole word in just a few more gate delays — that is how a 64-bit adder runs in ~6-7 gate levels.
The standard 4-bit CLA chip is the 74LS181 / 74181 ALU slice, and the look-ahead carry generator chip is the 74LS182. Cascading 74182s with 74181s gives you a fast 16-bit or 32-bit ALU out of two chips per slice.
7.5 The serial adder — fewest gates, slowest
The exact opposite of CLA: use one full adder plus a 1-bit carry flip-flop, and serially shift A and B into it bit-by-bit (LSB first):
Each clock cycle: the FA reads one bit from each operand and the stored carry, writes the Sum bit back into the shift register, and latches the new Cout into the carry flip-flop. After N clocks the sum is complete.
| Property | Ripple parallel adder | Serial adder | CLA |
|---|---|---|---|
| Gates | Many (4N) | Very few (1 FA + 1 FF) | Many (more than ripple) |
| Time | 2N gate delays | N clock cycles | ~ 4 gate delays |
| Use case | Default for small N | Wristwatches, smart cards | High-speed CPUs |
7.6 What an ALU actually is
An Arithmetic and Logic Unit is a combinational block that, given two operands and a "function select" code, can produce either an arithmetic result or a logical result.
The standard textbook example is the 74181 4-bit ALU, which has four function-select inputs (S3 S2 S1 S0) plus a Mode pin (M):
- When M = 0 it is in arithmetic mode — sixteen functions like ADD, SUB, INC, DEC, NEG, …
- When M = 1 it is in logic mode — sixteen functions like AND, OR, XOR, NAND, …
That gives 32 distinct operations from one chip, all in ~22 ns. The 74181 was the heart of every minicomputer ALU through the 1970s.
74181 partial function table (a sample of the 32)
| M | S3 S2 S1 S0 | Arithmetic (M=0) | Logical (M=1) |
|---|---|---|---|
| – | 0000 | A | NOT A |
| – | 0001 | A OR B | NOT (A OR B) |
| – | 0011 | minus 1 | logical 0 |
| – | 0110 | A minus B minus 1 | A XOR B |
| – | 1001 | A plus B | A XNOR B |
| – | 1111 | A | logical 1 |
(There are 32 entries in total — the chip is essentially a 32-function ROM realised in TTL.)
7.7 Elementary ALU design — the one-bit slice approach
To design an N-bit ALU from scratch, build a 1-bit slice and then replicate it N times.
To go from 4 bits to 16 bits, cascade four slices; to go faster, replace ripple carry with CLA. That is conceptually the entire ALU of a simple microprocessor.
7.8 Worked example — choosing the right adder
Design a 32-bit unsigned adder that must complete in 5 ns. Which architecture?
- Ripple carry: 32 stages × ~0.5 ns / stage ≈ 16 ns. Too slow.
- Serial adder: 32 clock cycles. Way too slow.
- Hierarchical CLA (4-bit blocks, 2 levels): ~ 4 × 0.5 ns = 2 ns. ✅
So the only architecture that hits the budget is a CLA. This is exactly the reasoning a CPU designer applies — and is why every modern CPU uses some form of CLA or Kogge-Stone tree adder.