Small amount of memory, ranging from a few KBs to MBs. Built using SRAM. Slower than registers but faster than main memory. Used to store frequently accessed data. A transparent speedup mechanism; performance increases even though the program instructions remain unchanged.
Cache block
Section titled “Cache block”Minimum unit of data that is transferred between cache and main memory. Tagged with memory address. Searched in parallel. A cache block stores data, tag for the memory address and some other metadata.
Multiple blocks are moved between levels in cache hierarchy.
Hit ratio
Section titled “Hit ratio”Fraction of memory accesses found in cache.
Hit time
Section titled “Hit time”Time to access a block in cache.
Cache miss
Section titled “Cache miss”When required item is not found in cache.
3 types:
- Compulsory: 1st access to a block
- Capacity: when no space left in cache
- Conflict (collision): when space is left, but multiple blocks compete for the same set
Miss ratio
Section titled “Miss ratio”Fraction of memory accesses not found in cache.
Miss penalty
Section titled “Miss penalty”Time to bring a block from memory plus deliver it to processor.
Cache Organization
Section titled “Cache Organization”Memory is divided into blocks (aka. lines) to be stored in cache.
Memory address split
Section titled “Memory address split”A memory address is split into tag, index, and offset.
- Offset
Selects the word within a block. If block size is 64 bytes, least significant 6 bits are used to address each word. - Index
Denotes the set. If the total number of sets are , then bits are used here. - Tag
Identifies which memory block is stored in the set’s line. Remnant bits are after offset and index.
Cache Entry Table
Section titled “Cache Entry Table”A small table is used to record cache entries. The table stores:
- tag
- index
- valid bit (1 bit)
Associativity
Section titled “Associativity”Defines how memory blocks are mapped into cache locations. Directly impacts miss rate, complexity, and speed.
Direct-Mapped
Section titled “Direct-Mapped”Each memory block maps to exactly one cache line. Mapping is simpler.
Simple implementation. Faster. High risk for conflict misses.
Used in embedded and high-speed systems where simplicity and speed outweigh flexibility.
Fully Associative
Section titled “Fully Associative”A memory block can go anywhere in the cache.
No fixed mapping. All cache lines are searched for a match. Minimum conflict misses. Complex hardware. Slower and more expensive to implement.
Used in small caches (like TLBs) where flexibility is more important than access speed.
N-Set-Associative
Section titled “N-Set-Associative”Balanced version between the above 2. Each set contains lines. A memory block maps to exactly one set; but can be placed in any line within that set. Reduces conflict misses with moderate hardware cost. Lines within a set is searched in parallel.
Replacement Policy
Section titled “Replacement Policy”When all lines in a set are full, one must be replaced. Common policies:
- LRU (Least Recently Used) – replace the least recently accessed line.
- Random – randomly choose a line.
- FIFO – replace the oldest line in the set.
Cache Levels
Section titled “Cache Levels”Modern processors use a multi-level cache hierarchy.
L1 Cache
Section titled “L1 Cache”The fastest and smallest cache. Located within each CPU core. Divided into L1 Instruction (I-cache) and L1 Data (D-cache). Supplies the CPU with immediate access to the most frequently used instructions and data. Size ranges from 16 KB to 128 KB. Access latency is typically 1-5 cycles.
L2 Cache
Section titled “L2 Cache”Acts as an intermediate buffer between L1 and L3 or main memory. Larger and slower than L1. Usually private to each core in modern processors. Size ranges from 256 KB to 2 MB. Access latency is typically 5-15 cycles.
L3 Cache
Section titled “L3 Cache”Larger and slower. Often shared among all cores on a CPU chip. Acts as a global buffer before main memory. Implemented off-core but on the same die (on-chip shared cache). Size ranges from 4 MB to 64 MB. Access latency is typically 20-50 cycles.
Increasing Cache Performance
Section titled “Increasing Cache Performance”Cache capacity can be increased to a certain extent. Predicting and prefetching can improve performance. Fully associative cache avoids conflicts and improve hit rate.