CPU cache

Posted on Aug 22, 2024

What is CPU cache

CPU cache is a high-speed but small memory located on or near the processor core.

Why CPU cache

Its primary purpose is to store frequently accessed data and instructions, reducing the time the CPU needs to retrieve this information from the main memory, because the CPU can access cache much faster than main memory which reduces the latency associated with accessing data.

The memory hierarchy

Memory hierarchy is a structured arrangement of storage systems in a computer, organized by speed, size and cost.

It goes as follows (from fastest to slowest):

  1. Registers

    • Fastest and smallest storage located within the CPU
  2. L1i/d, L2 and L3 Cache

    • Multi-level caches that provide fast access to frequently used data. L1 being the fastest and L3 the slowest.
  3. Main memory (RAM)

    • Larger but slower memory used to store data and instructions currently in use by the system.
  4. Longterm storage (Flash, Spinning disk)

    • Non-volatile storage that holds data long-term, access is slower compared to RAM.
  5. Remote storage (Internet)

    • Usually slower than secondary storage due to the latency caused by the overhead of the communications between computers.

So, cache memory sits between the CPU and the main memory, serving as a fast buffer that stores copies of frequently accessed intructions and data, thus reducing the time it takes for the CPU to retrieve data.

Before CPU cache

Before the introduction of CPU caches, processors had to fetch data directly from main memory for every operation. This process was slow and inefficient due to the significant speed difference between the CPU and main memory. As CPU speeds increased this disparity became more pronounced leading to performance bottlenecks.

Comparison of speed

Here’s a comparison of typical access times across different levels of the memory hierarchy, note that this values are hardware dependant so this is just to give a general idea:

Memory Type Access Time N cycles
Registers 0.2 - 1 ns 1 cycle
L1 cache 0.5 - 3 ns 1 - 3 cycles
L2 cache 5 - 20 ns 15 - 60 cycles
L3 cache 10 - 40 ns 30 - 120 cycles
Main memory (DRAM) 50 - 100 ns 150 - 300 cycles
Solid-State drive 10 - 100 μs 30,000 - 300,000 cycles
Hard-Disk drive 5 - 20 ms 15,000,000 - 60,000,000 cycles

Cache memory provides much faster access times compared to main memory, which is why it is crucial for reducing latency and improving the performance of the CPU.

How CPU cache works

Cache lines, cache hits and misses

CPU cache operates by storing copies of data from frequently accessed main memory locations. When the CPU needs to access data, it first checks whether the data is available in the cache. If the data is found (cache hit), it is returned quickly. If the data is not found (cache miss), the CPU must fetch the data from the slower main memory.

Modern CPUs don’t just fetch the exact data needed for a particular instruction when accessing memory. Instead, they fetch an entire cache line.

A cache line is a contiguous block of memory, typically ranging from 32 to 128 bytes in size, depending on the processor’s architecture. The idea is that when one piece of data is accessed, the data adjacent to it is also likely to be accessed soon (this is known as spatial locality).

By fetching the entire cache line, the CPU increases the chances of future accesses resulting in cache hits, thereby reducing the need for further memory accesses.

For example, when accessing an element of an array, it is highly probable that nearby elements will also be accessed shortly. By loading the entire cache line, the CPU can serve future requests directly from the cache, which significantly speeds up data retrieval.

This strategy enhances the overall efficiency of the cache system by reducing the number of cache misses, particularly the compulsory misses (those that occur the first time a data item is accessed).

Cache temporal locality and spatial locality

Cache performance is heavily influenced by the principles of temporal locality and spatial locality:

Temporal Locality: recently accessed data is likely to be accessed again soon. Caches take advantage of temporal locality by storing recently accessed data, anticipating that the CPU will need it again.

Spatial Locality: data located near recently accessed data is likely to be accessed soon. Caches utilize spatial locality by storing blocks of data that include the requested information and adjacent memory addresses.

The 3 C’s of cache misses

Cache misses can be categorized into three types, known as the 3 C’s:

  1. Compulsory Misses: when data is accessed for the first time and must be loaded into the cache.

  2. Capacity Misses: when the cache is too small to hold all the data required by the CPU, leading to some data being evicted and causing misses.

  3. Conflict Misses: these occur in set-associative or direct-mapped caches when multiple data items compete for the same cache line, leading to evictions and misses

Replacement Algorithms

When the cache is full, something has to be trashed to make room for new data. The choice depends on which replacement algorithm is used:

  1. FIFO: The oldest data gets replaced. Not very good.

  2. LRU: The data that hasn’t been used for the longest time gets replaced. Most commonly used in multitasking environments.

  3. LFU: The least accessed data gets replaced. Can be useful in some embedded scenarios where a computer is frequently doing something.

  4. Random: A random block is replaced, which is the simplest and cheaper to implement.

CPU Cache Address and Tag

The cache address and tag system is used to quickly determine whether the data the CPU needs is available (or not) in the cache. This process is fundamental to how caches operate, and it works differently depending on the type of cache organization algorithm (fully associative, direct mapped, or set-associative).

Fully Associative vs Direct Mapped vs Set-Associative Cache

Caches manage how data is stored and retrieved using three main algorithms:

  1. Fully Associative Cache

How it works: any block of data from memory can be stored in any cache line. There are no restrictions on placement within the cache.

Tag comparison: CPU compares the tag (Block ID in the memory address) stored in every cache line to find a match. This involves comparing many bits of data, making the process complex.

Circuit complexity: Since the CPU has to check every line, the circuit is more complex, leading to higher costs in both design and implementation.

Use case: Fully associative caches are usually found in small, specialized caches where flexibility and minimizing cache misses are critical.

  1. Direct Mapped Cache

How it works: each block of memory has only one possible location in the cache, determined by the lower bits of the memory address (Line ID).

Tag comparison: CPU only needs to compare the tag of the requested data with the tag of the cache line it is mapped to, making the comparison simple and fast.

Circuit complexity: The design is more straightforward, which makes direct mapped caches fast and cheap to build.

Trashing: When two or more memory blocks frequently access the same cache line, they keep evicting each other. This continuous eviction is called trashing, and it significantly degrades performance because the CPU spends more time fetching data from slower memory.

Use case: Direct mapped caches are used in cost-sensitive designs where simplicity and speed are prioritized, but they can suffer from trashing.

  1. Set-Associative Cache

How it works: set-associative cache is a middle ground between fully associative and direct mapped caches. The cache is divided into sets, and each block of memory can be stored in any cache line within its designated set. The set is determined by the lower bits of the memory address (Set ID), while the data can go into any line within that set.

Tag comparison: CPU compares the tag of the requested data with the tags of all cache lines within the set. This reduces the chance of trashing compared to direct mapped caches, as data has more placement options.

Circuit complexity: The circuit is more complex than in a direct mapped cache but simpler than in a fully associative cache.

Use case: Set-associative caches are common in modern CPUs because they balance the need to avoid trashing with reasonable hardware complexity.

At the hardware level, cache misses can be reduced by changing capacity, block size, and/or associativity.

Cache Write Policies, Flag Bits, and Split Caches

TODO

References

  • Harris, S. L., & Harris, D. M. (2021). Digital design and computer architecture: RISC-V edition. Morgan Kaufmann.
  • Patterson, D. A., & Waterman, A. (2020). Computer organization and design RISC-V edition: The hardware software interface (2nd ed.). Morgan Kaufmann.
  • Hennessy, J. L., & Patterson, D. A. (2017). Computer architecture: A quantitative approach (6th ed.). Morgan Kaufmann.