Memory Hierarchy Pyramid
---
SRAM vs DRAM Comparison
| Feature | SRAM | DRAM |
|---|---|---|
| Technology | Bistable flip-flop (6 transistors per cell) | Capacitor + transistor (1 transistor per cell) |
| Speed | Very fast (~1–5 ns) | Slower (~50–100 ns) |
| Power | Higher (always drawing current) | Lower (standby is low power) |
| Cost per bit | High | Low |
| Density | Low (6T per cell = larger) | High (1T per cell = denser) |
| Refresh needed | No (static) | Yes — capacitors leak; refresh every ~64 ms |
| Volatile | Yes | Yes |
| Use | CPU cache (L1, L2, L3) | Main memory (RAM DIMMs) |
---
Auxiliary Memory
| Type | Technology | Typical Capacity | Access Time | Cost/GB | Best Use |
|---|---|---|---|---|---|
| HDD | Magnetic disk (spinning) | 1–20 TB | 5–10 ms | Very Low | Mass storage, backups |
| SSD (SATA) | NAND Flash | 128 GB–4 TB | 50–100 µs | Low | OS, applications |
| SSD (NVMe) | NAND Flash via PCIe | 256 GB–8 TB | 20–50 µs | Medium | High-perf storage |
| Optical (BD) | Laser on disc | 25–100 GB | 80–120 ms | Very Low | Media distribution |
| Magnetic Tape | Magnetic coating | 10–300 TB/cartridge | seconds–minutes | Extremely Low | Long-term archival |
---
Associative Memory (Content-Addressable Memory, CAM)
Unlike regular RAM (indexed by address), CAM is indexed by content:
- You provide a search key (data) and CAM returns the address of matching entry
- All cells compared simultaneously in parallel (hardware comparators in every cell)
- Very fast search O(1) regardless of table size
- Applications: TLB (Translation Lookaside Buffer), routing tables, cache tag matching
| Feature | RAM | CAM |
|---|---|---|
| Access method | Address → Data | Data → Address |
| Search time | O(1) by address | O(1) by content |
| Hardware cost | Low (passive cells) | High (comparator per cell) |
| Density | High | Low (6–10× less dense than RAM) |
| Application | General storage | TLB, cache tags, network routers |
---
Cache Mapping Techniques
1. Direct Mapping
Block i maps to cache line: line = i mod (number of cache lines)
| Address | Tag | Line (Index) | Block Offset |
|---|---|---|---|
| 32-bit | Upper bits | Middle bits (log₂ lines) | Lower bits (log₂ block_size) |
Problem: Two heavily used blocks mapping to same line cause constant evictions (conflict misses).
2. Fully Associative
Any block can go in any cache line. Requires comparing tag against all lines simultaneously.
Advantage: No conflict misses. Disadvantage: Very expensive hardware (n comparators for n lines).
3. k-Way Set-Associative
Cache divided into S sets of k lines each. Block maps to set = (block number) mod S, but within the set, can go in any of k lines.
Compromise: k=2 or k=4 way gives most of the benefit of fully associative with modest hardware.
---
Cache Mapping Comparison
| Scheme | Block Placement | Tag Compare | Conflict Misses | Hardware Cost |
|---|---|---|---|---|
| Direct Mapped | 1 fixed line | 1 comparator | High | Very Low |
| 2-Way Set-Assoc | 1 of 2 lines in a set | 2 comparators | Reduced | Low |
| 4-Way Set-Assoc | 1 of 4 lines in a set | 4 comparators | Lower | Medium |
| Fully Associative | Any line | n comparators | Minimal | Very High |
---
Cache Performance Formula
Average Access Time (AAT) = h × Tc + (1−h) × Tm
Where: h = hit rate, Tc = cache access time, Tm = main memory access time
Worked Example:
- Hit rate h = 0.90, Cache time Tc = 5 ns, Main memory Tm = 100 ns
- AAT = 0.90 × 5 + 0.10 × 100 = 4.5 + 10 = 14.5 ns
- Compare to no cache: 100 ns → 6.9× speedup!
---
Cache Replacement Policies
| Policy | Description | Implementation | Performance | Notes |
|---|---|---|---|---|
| LRU (Least Recently Used) | Evict block not used for longest time | Age counter per line; complex | Near-optimal | Best for temporal locality |
| FIFO (First In First Out) | Evict oldest loaded block | Simple queue | Good | Can suffer Belady's anomaly |
| LFU (Least Frequently Used) | Evict block with lowest access count | Frequency counter | Good for repeated access | Can retain stale old data |
| Random | Evict a randomly selected block | Simple random selector | Surprisingly good | Simple hardware |
| Optimal (Belady) | Evict block not used for longest future time | Requires future knowledge | Best possible | Theoretical benchmark only |
---
Virtual Memory Address Translation
---
Page Replacement Algorithms (3 frames, ref string: 1,2,3,4,1,2,5,1,2,3,4,5)
| Algorithm | Page Faults | Description | Pros | Cons |
|---|---|---|---|---|
| FIFO | 9 | Evict page loaded first | Simple; no hardware needed | Belady's anomaly; ignores usage |
| LRU | 8 | Evict page not used for longest time | Near-optimal; exploits locality | Complex hardware (or software overhead) |
| Optimal (OPT) | 7 | Evict page not used for longest future time | Minimum possible faults | Not implementable (needs future knowledge) |
| Clock (2nd Chance) | ~8 | FIFO but give pages a second chance via reference bit | Good approximation of LRU | Slight overhead; large page tables |
| LFU | Variable | Evict least frequently used page | Good for stable hot set | Old pages with high historical count persist |
---
Belady's Anomaly
Normally, more page frames → fewer page faults. But FIFO can exhibit the opposite:
- Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
- 3 frames (FIFO): 9 page faults
- 4 frames (FIFO): 10 page faults ← more faults with more frames!
This is Belady's Anomaly — specific to FIFO. LRU and Optimal do NOT exhibit it (they are stack algorithms). Solution: use LRU or Clock algorithm instead of FIFO.
---
Study Deep: TLB Shootdowns in Multicore Systems
When an OS unmaps a virtual page (e.g., during munmap or context switch), it must invalidate TLB entries in ALL CPU cores that might have cached the mapping. This requires:
- CPU A sends an IPI (Inter-Processor Interrupt) to all other CPUs
- Each CPU receives the IPI and executes INVLPG (invalidate TLB page entry)
- CPU A waits for all CPUs to acknowledge the TLB flush
- CPU A resumes
This TLB shootdown is a significant performance bottleneck in:
- High-performance computing (many processes sharing memory)
- Virtualization (hypervisor managing guest VM page tables)
- Databases (frequent mmap/munmap for large datasets)
📝 Exam Tips: - SRAM: 6 transistors per cell, fast, no refresh, expensive — used for cache - DRAM: 1 transistor + capacitor per cell, slow, needs refresh — used for main memory - Cache hit rate of 90% with 5 ns cache and 100 ns RAM → AAT = 14.5 ns - Direct mapping: simple but conflict misses; Set-associative: compromise; Fully-associative: best hit rate but most hardware - Belady's anomaly: only FIFO suffers it. LRU is a stack algorithm — immune to Belady's