Olivine logo
jpnt blog / Journal / CPU Cache

CPU Cache

#systems

What is CPU cache

CPU cache is a small but high-speed memory located on or near the processor core. It is designed to store frequently accessed data and instructions, allowing the CPU to retrieve them far more quickly than from the main memory. Modern CPUs have multiple levels of cache (L1, L2, and L3), each balancing size and speed to optimize data access speed and overall system performance.

Why CPU cache

The main purpose of CPU cache is to reduce the time it takes for the CPU to retrieve data from main memory by storing recently or frequently accessed data closer to the CPU. Since cache memory is significantly faster than main memory, using cache reduces the latency associated with memory access. This reduction in latency boosts the CPU's ability to process instructions efficiently, minimizing bottlenecks that arise due to slower main memory access speeds.

The memory hierarchy

Memory hierarchy refers to the structured arrangement of storage within a computer system, organized by speed, size, and cost.

The hierarchy goes as follows (from fastest to slowest):

  1. Registers

    • Fastest and smallest storage located within the CPU
  2. L1i/d, L2 and L3 Cache

    • Multi-level caches that provide fast access to frequently used data. L1 being the fastest and L3 the slowest.
  3. Main memory (RAM)

    • Larger but slower memory used to store data and instructions currently in use by the system.
  4. Longterm storage (Flash, Spinning disk)

    • Non-volatile storage that holds data long-term, access is slower compared to RAM.
  5. Remote storage (Internet)

    • Usually slower than secondary storage due to the latency caused by the overhead of the communications between computers.

So, cache memory sits between the CPU and the main memory, serving as a fast buffer that stores copies of frequently accessed intructions and data, thus reducing the time it takes for the CPU to retrieve data.

Before CPU cache

Before the introduction of CPU caches, processors had to fetch data directly from main memory for every operation. This process was slow and inefficient due to the significant speed difference between the CPU and main memory. As CPU speeds increased this disparity became more pronounced leading to performance bottlenecks.

Comparison of speed

Here’s a comparison of typical access times across different levels of the memory hierarchy, note that this values are hardware dependant so this is just to give a general idea:

Memory TypeAccess TimeN cycles
Registers0.2 - 1 ns1 cycle
L1 cache0.5 - 3 ns1 - 3 cycles
L2 cache5 - 20 ns15 - 60 cycles
L3 cache10 - 40 ns30 - 120 cycles
Main memory (DRAM)50 - 100 ns150 - 300 cycles
Solid-State drive10 - 100 μs30,000 - 300,000 cycles
Hard-Disk drive5 - 20 ms15,000,000 - 60,000,000 cycles

Cache memory provides much faster access times compared to main memory, which is why it is crucial for reducing latency and improving the performance of the CPU.

How CPU cache works

Cache lines, cache hits and misses

CPU cache operates by storing copies of data from frequently accessed main memory locations. When the CPU needs to access data, it first checks whether the data is available in the cache. If the data is found (cache hit), it is returned quickly. If the data is not found (cache miss), the CPU must fetch the data from the slower main memory.

Modern CPUs don't just fetch the exact data needed for a particular instruction when accessing memory. Instead, they fetch an entire cache line.

A cache line is a contiguous block of memory, typically ranging from 32 to 128 bytes in size, depending on the processor's architecture. The idea is that when one piece of data is accessed, the data adjacent to it is also likely to be accessed soon (this is known as spatial locality).

By fetching the entire cache line, the CPU increases the chances of future accesses resulting in cache hits, thereby reducing the need for further memory accesses.

For example, when accessing an element of an array, it is highly probable that nearby elements will also be accessed shortly. By loading the entire cache line, the CPU can serve future requests directly from the cache, which significantly speeds up data retrieval.

This strategy enhances the overall efficiency of the cache system by reducing the number of cache misses, particularly the compulsory misses (those that occur the first time a data item is accessed).

Cache temporal locality and spatial locality

Cache performance is heavily influenced by the principles of temporal locality and spatial locality:

Temporal Locality: recently accessed data is likely to be accessed again soon. Caches take advantage of temporal locality by storing recently accessed data, anticipating that the CPU will need it again.

Spatial Locality: data located near recently accessed data is likely to be accessed soon. Caches utilize spatial locality by storing blocks of data that include the requested information and adjacent memory addresses.

The 3 C's of cache misses

Cache misses can be categorized into three types, known as the 3 C's:

  1. Compulsory Misses: when data is accessed for the first time and must be loaded into the cache.

  2. Capacity Misses: when the cache is too small to hold all the data required by the CPU, leading to some data being evicted and causing misses.

  3. Conflict Misses: these occur in set-associative or direct-mapped caches when multiple data items compete for the same cache line, leading to evictions and misses

Replacement Algorithms

When the cache is full, something has to be trashed to make room for new data. The choice depends on which replacement algorithm is used:

  1. FIFO: The oldest data gets replaced. Not very good.

  2. LRU: The data that hasn't been used for the longest time gets replaced. Most commonly used in multitasking environments.

  3. LFU: The least accessed data gets replaced. Can be useful in some embedded scenarios where a computer is frequently doing something.

  4. Random: A random block is replaced, which is the simplest and cheaper to implement.

CPU Cache Address and Tag

The cache address and tag system is used to quickly determine whether the data the CPU needs is available (or not) in the cache. This process is fundamental to how caches operate, and it works differently depending on the type of cache organization algorithm (fully associative, direct mapped, or set-associative).

Fully Associative vs Direct Mapped vs Set-Associative Cache

There are three main algorithms that determine how data is stored and retrieved:

  1. Fully Associative Cache
  1. Direct Mapped Cache
  1. Set-Associative Cache

NOTE: At the hardware level, cache misses can be reduced by changing capacity, block size, and/or N-set associativity.

CPU Cache Flag Bits

Flag bits are used in each cache line to track the status of the data:

Cache Write Policies

A cache write policy defines how data is written to the main memory once it is written to the cache.

Unified vs Split Caches

Overview: Data Structure Alignment

Aligned data means data is stored at memory addresses that match its size (e.g., 4-byte integer at an address divisible by 4). Proper alignment speeds up access and reduces cache misses because misaligned data might span multiple cache lines, increasing access time.

Overview: Virtual Memory Effect on CPU Cache

Virtual memory uses addresses mapped to physical memory, which can affect cache efficiency due to aliasing (where different virtual addresses refer to the same physical address). Techniques like TLBs (Translation Lookaside Buffers) and page coloring help manage this by keeping virtual memory organized and reducing unnecessary cache invalidations.

Overview: CPU Pipelining Effect on CPU Cache

In pipelined CPUs, each instruction phase (fetch, decode, execute) happens in parallel. If there’s a cache miss, it can stall the pipeline, slowing down all stages.

Non-blocking caches help reduce stalls by allowing the pipeline to continue executing while fetching from memory.

Overview: CPU Branch Preduction correlation with CPU Cache

Branch prediction can lead to cache pollution if predictions are incorrect. When the CPU loads instructions based on a predicted path that turns out to be wrong, it wastes cache space, potentially evicting useful data.

Prefetching only high-confidence branches can help reduce this effect.

Example: E31 Core with FreeRTOS Data Structures:

In the SiFive E31 core running FreeRTOS, the operating system frequently accesses task control blocks (TCBs). When multiple tasks are running, frequently accessed TCBs stay in the cache, improving task-switching speed.

Using small, aligned TCB structures ensures fewer cache misses and better cache utilization.

References