Speaker: Chi-Keung Luk, Carnegie Mellon and Univ. of Toronto
Date: March 11, 1999
Cooperative Prefetching: Compiler and Hardware Support for Effective Instruction Prefetching in Modern Processors
Instruction cache miss latency is becoming an increasingly important performance bottleneck, especially for commercial applications. Although instruction prefetching is an attractive technique for tolerating this latency, we find that existing prefetching schemes are insufficient for modern superscalar processors since they fail to issue prefetches early enough (particularly for non-sequential accesses). To overcome these limitations, we propose a new instruction prefetching technique whereby the hardware and software cooperate to hide the latency as follows. The hardware performs aggressive sequential prefetching combined with a novel prefetch filtering mechanism to allow it to get far ahead without polluting the cache. To hide the latency of non-sequential accesses, we propose and implement a novel compiler algorithm which automatically inserts instruction-prefetch instructions into the executable to prefetch the targets of control transfers far enough in advance. Our experimental results demonstrate that this new approach results in speedups ranging from 9.4% to 18.5% (13.3% on average) over the original execution time on an out-of-order superscalar processor, which is more than double the average speedup of the best existing schemes (6.5%). This is accomplished by hiding an average of 71% of the original instruction stall time, compared with only 36% for the best existing schemes. We find that both the prefetch filtering and compiler-inserted prefetching components of our design are essential and complementary, that the compiler can limit the code expansion to less than 10% on average, and that our scheme is robust with respect to variations in miss latency and bandwidth.
This is joint work with Todd C. Mowry.
Chi-Keung Luk is a Ph.D. candidate in the Department of Computer Science at the University of Toronto, and is currently a visiting scholar at Carnegie Mellon University. He received his B.Sc. (First Class Honors) and M.Phil. degrees in computer science, both from The Chinese University of Hong Kong. His research interests span computer architecture, compilers, and programming languages, with a focus on the memory performance of non-numeric applications. He has been awarded a Canadian Commonwealth Fellowship, an IBM CAS Fellowship, and a Croucher Foundation Fellowship.