CPU Cache

Posted on Oct 28, 2024

What is CPU cache

CPU cache is a small but high-speed memory located on or near the processor core. It is designed to store frequently accessed data and instructions, allowing the CPU to retrieve them far more quickly than from the main memory. Modern CPUs have multiple levels of cache (L1, L2, and L3), each balancing size and speed to optimize data access speed and overall system performance.

Why CPU cache

The main purpose of CPU cache is to reduce the time it takes for the CPU to retrieve data from main memory by storing recently or frequently accessed data closer to the CPU. Since cache memory is significantly faster than main memory, using cache reduces the latency associated with memory access. This reduction in latency boosts the CPU’s ability to process instructions efficiently, minimizing bottlenecks that arise due to slower main memory access speeds.

The memory hierarchy

Memory hierarchy refers to the structured arrangement of storage within a computer system, organized by speed, size, and cost.

The hierarchy goes as follows (from fastest to slowest):

Registers
- Fastest and smallest storage located within the CPU
L1i/d, L2 and L3 Cache
- Multi-level caches that provide fast access to frequently used data. L1 being the fastest and L3 the slowest.
Main memory (RAM)
- Larger but slower memory used to store data and instructions currently in use by the system.
Longterm storage (Flash, Spinning disk)
- Non-volatile storage that holds data long-term, access is slower compared to RAM.
Remote storage (Internet)
- Usually slower than secondary storage due to the latency caused by the overhead of the communications between computers.

So, cache memory sits between the CPU and the main memory, serving as a fast buffer that stores copies of frequently accessed intructions and data, thus reducing the time it takes for the CPU to retrieve data.

Before CPU cache

Before the introduction of CPU caches, processors had to fetch data directly from main memory for every operation. This process was slow and inefficient due to the significant speed difference between the CPU and main memory. As CPU speeds increased this disparity became more pronounced leading to performance bottlenecks.

Comparison of speed

Here’s a comparison of typical access times across different levels of the memory hierarchy, note that this values are hardware dependant so this is just to give a general idea:

Memory Type	Access Time	N cycles
Registers	0.2 - 1 ns	1 cycle
L1 cache	0.5 - 3 ns	1 - 3 cycles
L2 cache	5 - 20 ns	15 - 60 cycles
L3 cache	10 - 40 ns	30 - 120 cycles
Main memory (DRAM)	50 - 100 ns	150 - 300 cycles
Solid-State drive	10 - 100 μs	30,000 - 300,000 cycles
Hard-Disk drive	5 - 20 ms	15,000,000 - 60,000,000 cycles

Cache memory provides much faster access times compared to main memory, which is why it is crucial for reducing latency and improving the performance of the CPU.

How CPU cache works

Cache lines, cache hits and misses

CPU cache operates by storing copies of data from frequently accessed main memory locations. When the CPU needs to access data, it first checks whether the data is available in the cache. If the data is found (cache hit), it is returned quickly. If the data is not found (cache miss), the CPU must fetch the data from the slower main memory.

Modern CPUs don’t just fetch the exact data needed for a particular instruction when accessing memory. Instead, they fetch an entire cache line.

A cache line is a contiguous block of memory, typically ranging from 32 to 128 bytes in size, depending on the processor’s architecture. The idea is that when one piece of data is accessed, the data adjacent to it is also likely to be accessed soon (this is known as spatial locality).

By fetching the entire cache line, the CPU increases the chances of future accesses resulting in cache hits, thereby reducing the need for further memory accesses.

For example, when accessing an element of an array, it is highly probable that nearby elements will also be accessed shortly. By loading the entire cache line, the CPU can serve future requests directly from the cache, which significantly speeds up data retrieval.

This strategy enhances the overall efficiency of the cache system by reducing the number of cache misses, particularly the compulsory misses (those that occur the first time a data item is accessed).

Cache temporal locality and spatial locality

Cache performance is heavily influenced by the principles of temporal locality and spatial locality:

Temporal Locality: recently accessed data is likely to be accessed again soon. Caches take advantage of temporal locality by storing recently accessed data, anticipating that the CPU will need it again.

Spatial Locality: data located near recently accessed data is likely to be accessed soon. Caches utilize spatial locality by storing blocks of data that include the requested information and adjacent memory addresses.

The 3 C’s of cache misses

Cache misses can be categorized into three types, known as the 3 C’s:

Compulsory Misses: when data is accessed for the first time and must be loaded into the cache.
Capacity Misses: when the cache is too small to hold all the data required by the CPU, leading to some data being evicted and causing misses.
Conflict Misses: these occur in set-associative or direct-mapped caches when multiple data items compete for the same cache line, leading to evictions and misses

Replacement Algorithms

When the cache is full, something has to be trashed to make room for new data. The choice depends on which replacement algorithm is used:

FIFO: The oldest data gets replaced. Not very good.
LRU: The data that hasn’t been used for the longest time gets replaced. Most commonly used in multitasking environments.
LFU: The least accessed data gets replaced. Can be useful in some embedded scenarios where a computer is frequently doing something.
Random: A random block is replaced, which is the simplest and cheaper to implement.

CPU Cache Address and Tag

The cache address and tag system is used to quickly determine whether the data the CPU needs is available (or not) in the cache. This process is fundamental to how caches operate, and it works differently depending on the type of cache organization algorithm (fully associative, direct mapped, or set-associative).

Fully Associative vs Direct Mapped vs Set-Associative Cache

There are three main algorithms that determine how data is stored and retrieved:

Fully Associative Cache

How it works: Any memory block can be stored in any cache line, giving maximum flexibility in placement.
Tag Comparison: The CPU compares the tag in every cache line to find a match, making this process complex and costly.
Use Case: Used in small caches where flexibility and avoiding misses is more important than cost.

Direct Mapped Cache

How it works: Each memory block has only one possible location in the cache, determined by a part of its address.
Tag Comparison: The CPU only checks the tag for that specific line, making comparison simple and fast.
Drawback (Trashing): Frequent evictions can occur if multiple memory blocks map to the same line, reducing performance.
Use Case: Cost-effective design, useful for simpler systems.

Set-Associative Cache

How it works: Cache is divided into “sets.” Each block of memory can be stored in any line within a designated set.
Tag Comparison: The CPU checks tags of all lines in a set, providing flexibility with moderate complexity.
Use Case: Most common in modern CPUs as it balances cost and performance.

NOTE: At the hardware level, cache misses can be reduced by changing capacity, block size, and/or N-set associativity.

CPU Cache Flag Bits

Flag bits are used in each cache line to track the status of the data:

Valid Bit: Indicates if the data in the cache line is valid (usable).
Dirty Bit: Shows if the data has been modified. If dirty, it needs to be written back to memory before replacement.
LRU Bit(s): Used in caches with replacement policies to track the “least recently used” line, helping decide which data to evict when new data comes in.

Cache Write Policies

A cache write policy defines how data is written to the main memory once it is written to the cache.

Write-Through: Data is written to both the cache and main memory at the same time. This is simple and ensures consistency but can be slower.
Write-Back: Data is written to the cache only and written to memory later, when it’s replaced. This is faster but needs the dirty bit to ensure modified data is written back.

Unified vs Split Caches

Unified Cache: Stores both instructions and data. This is more space-efficient but can lead to contention if instructions and data are frequently needed at the same time.
Split Cache: Divides L1 cache into separate caches for instructions and data (L1i and L1d). This speeds up access by reducing contention but requires more cache space.

Overview: Data Structure Alignment

Aligned data means data is stored at memory addresses that match its size (e.g., 4-byte integer at an address divisible by 4). Proper alignment speeds up access and reduces cache misses because misaligned data might span multiple cache lines, increasing access time.

Overview: Virtual Memory Effect on CPU Cache

Virtual memory uses addresses mapped to physical memory, which can affect cache efficiency due to aliasing (where different virtual addresses refer to the same physical address). Techniques like TLBs (Translation Lookaside Buffers) and page coloring help manage this by keeping virtual memory organized and reducing unnecessary cache invalidations.

Overview: CPU Pipelining Effect on CPU Cache

In pipelined CPUs, each instruction phase (fetch, decode, execute) happens in parallel. If there’s a cache miss, it can stall the pipeline, slowing down all stages.

Non-blocking caches help reduce stalls by allowing the pipeline to continue executing while fetching from memory.

Overview: CPU Branch Preduction correlation with CPU Cache

Branch prediction can lead to cache pollution if predictions are incorrect. When the CPU loads instructions based on a predicted path that turns out to be wrong, it wastes cache space, potentially evicting useful data.

Prefetching only high-confidence branches can help reduce this effect.

Example: E31 Core with FreeRTOS Data Structures:

In the SiFive E31 core running FreeRTOS, the operating system frequently accesses task control blocks (TCBs). When multiple tasks are running, frequently accessed TCBs stay in the cache, improving task-switching speed.

Using small, aligned TCB structures ensures fewer cache misses and better cache utilization.

References

Harris, S. L., & Harris, D. M. (2021). Digital design and computer architecture: RISC-V edition. Morgan Kaufmann.
Patterson, D. A., & Waterman, A. (2020). Computer organization and design RISC-V edition: The hardware software interface (2nd ed.). Morgan Kaufmann.
Hennessy, J. L., & Patterson, D. A. (2017). Computer architecture: A quantitative approach (6th ed.). Morgan Kaufmann.