Home » Posts

Cache Modeling: Prefetching

2025-09-20 · Arsh Sharma
  1. Introduction
  2. Prefetching Strategies
  3. Prefetcher Knobs
  4. Metrics of Prefetcher Effectiveness
  5. Implementation
  6. Results

Introduction

Prefetching is a technique used to improve cache performance by predicting future memory accesses and loading data into the cache before it is actually requested by the CPU. This can help reduce cache misses and improve overall system performance.

Hardware data prefetching works by predicting the memory access pattern of the program and speculatively issuing prefetch requests to the predicted memory addresses before the program accesses those addresses.[1] We say that prefetching can effectively hide the latency of memory accesses and improve performance if:

I like to quote an example of ordering food while leaving for home from work.

If you order food just as you reach home, you will have to wait for the food to arrive. But if you order food while leaving work, by the time you reach home, the food will be there waiting for you.

But prefetching can negatively impact the performance:

Prefetching Strategies

There are several prefetching strategies, including:

Prefetcher Knobs

The aggressiveness of a prefetcher can be controlled by:

Based on the above knobs,

Metrics of Prefetcher Effectiveness

Prefetch Accuracy=Number of Useful PrefetchesNumber of Prefetches Sent to Memory \text{Prefetch Accuracy} = \frac{\text{Number of Useful Prefetches}} {\text{Number of Prefetches Sent to Memory}}

where Number of Useful Prefetches is the number of prefetched cache blocks that are used by demand requests

Prefetch Lateness=Number of Late PrefetchesNumber of Useful Prefetches \text{Prefetch Lateness} = \frac{\text{Number of Late Prefetches}}{\text{Number of Useful Prefetches}} Prefetcher Generated Cache Pollution=Number of Demand Misses Caused By the PrefetcherNumber of Demand Misses \text{Prefetcher Generated Cache Pollution} = \frac{\text{Number of Demand Misses Caused By the Prefetcher}}{\text{Number of Demand Misses}}

Implementation

If you recall, we implemented a generic Cache class in the previous blog post. We can extend that class to include a prefetch function.

void prefetch(uint32_t address_ptr, char* i_or_d, char* r_or_w);

The prefetch function does the following:

address_ptr + blockSize

and then for each access to the cache, we can call the prefetch function.

if (prefetch) iCache->prefetch(pc, &i_or_d, &r_or_w);
if (prefetch) dCache->prefetch(address, &i_or_d, &r_or_w);

Results

Using the prefetch function for both I-cache and D-cache, we can benchmark the performance of our cache model with and without prefetching. We can use the same traces as before and compare the metrics like:

MetricNo PrefetchingNext Line PF
total I-cache accesses15,009,37715,009,377
total I-cache misses30,990 16,009
total I-cache penalties1,952,100 922,450
I-cache miss rate0.21% 0.11%
avg I-cache access time2.13 cycles 2.06 cycles
total D-cache accesses4,990,6234,990,623
total D-cache misses37,896 9,964
total D-cache penalties5,084,000 718,400
D-cache miss rate0.76% 0.20%
avg D-cache access time3.02 cycles 2.14 cycles
total L2-cache accesses68,886 113,925
total L2-cache misses35,918 38,085
total L2-cache penalties3,591,800 3,808,500
L2-cache miss rate52.14% 33.43%
avg L2-cache access time102.14 cycles 83.43 cycles
AMAT2.352.08

Already by using the most simple next-line prefetching strategy, we can see a significant improvement in the overall AMAT of the system from 2.35 cycles to 2.08 cycles.

[1] https://hps.ece.utexas.edu/pub/srinath_hpca07.pdf