Skip to content
Sahithyan's S3
1
Sahithyan's S3 — Computer Architecture

Cache

Small amount of memory, ranging from a few KBs to MBs. Built using SRAM. Slower than registers but faster than main memory. Used to store frequently accessed data. A transparent speedup mechanism; performance increases even though the program instructions remain unchanged.

Minimum unit of data that is transferred between cache and main memory. Tagged with memory address. Searched in parallel. A cache block stores data, tag for the memory address and some other metadata.

Multiple blocks are moved between levels in cache hierarchy.

Fraction of memory accesses found in cache.

Time to access a block in cache.

When required item is not found in cache.

3 types:

  • Compulsory: 1st access to a block
  • Capacity: when no space left in cache
  • Conflict (collision): when space is left, but multiple blocks compete for the same set

Fraction of memory accesses not found in cache.

Time to bring a block from memory plus deliver it to processor.

Memory is divided into blocks (aka. lines) to be stored in cache.

A memory address is split into tag, index, and offset.

  • Offset
    Selects the word within a block. If block size is 64 bytes, least significant 6 bits are used to address each word.
  • Index
    Denotes the set. If the total number of sets are 2n2^n, then nn bits are used here.
  • Tag
    Identifies which memory block is stored in the set’s line. Remnant bits are after offset and index.

A small table is used to record cache entries. The table stores:

  • tag
  • index
  • valid bit (1 bit)

Defines how memory blocks are mapped into cache locations. Directly impacts miss rate, complexity, and speed.

Each memory block maps to exactly one cache line. Mapping is simpler.

Index=Block addressmodNumber of lines\text{Index} = \text{Block address} \mod \text{Number of lines}

Simple implementation. Faster. High risk for conflict misses.

Used in embedded and high-speed systems where simplicity and speed outweigh flexibility.

A memory block can go anywhere in the cache.

No fixed mapping. All cache lines are searched for a match. Minimum conflict misses. Complex hardware. Slower and more expensive to implement.

Used in small caches (like TLBs) where flexibility is more important than access speed.

Balanced version between the above 2. Each set contains NN lines. A memory block maps to exactly one set; but can be placed in any line within that set. Reduces conflict misses with moderate hardware cost. Lines within a set is searched in parallel.

When all lines in a set are full, one must be replaced. Common policies:

  • LRU (Least Recently Used) – replace the least recently accessed line.
  • Random – randomly choose a line.
  • FIFO – replace the oldest line in the set.

Modern processors use a multi-level cache hierarchy.

The fastest and smallest cache. Located within each CPU core. Divided into L1 Instruction (I-cache) and L1 Data (D-cache). Supplies the CPU with immediate access to the most frequently used instructions and data. Size ranges from 16 KB to 128 KB. Access latency is typically 1-5 cycles.

Acts as an intermediate buffer between L1 and L3 or main memory. Larger and slower than L1. Usually private to each core in modern processors. Size ranges from 256 KB to 2 MB. Access latency is typically 5-15 cycles.

Larger and slower. Often shared among all cores on a CPU chip. Acts as a global buffer before main memory. Implemented off-core but on the same die (on-chip shared cache). Size ranges from 4 MB to 64 MB. Access latency is typically 20-50 cycles.

Cache capacity can be increased to a certain extent. Predicting and prefetching can improve performance. Fully associative cache avoids conflicts and improve hit rate.