page 1  (13 pages)
2to next section

Streaming Prefetch ?

O. Temam

Versailles University

45 Av. des Etats-Unis

78000 Versailles

France

Phone: 33-1-39-25-43-41

Fax: 33-1-39-25-40-57

Email: temam@prism.uvsq.fr

Abstract

In most commercial processors, data prefetching has been disregarded as a potentially effective solution to hide cache misses, multi-level caches being widely preferred. However, multi-level caches are mostly effective at removing capacity and conflict misses, while prefetching is particularly efficient for removing compulsory misses, especially in the regular accesses found in numerical codes. One of the main flaws of prefetching which stronly limits its popularity in current processors is that it can potentially degrade global cache performance. Wrong address predictions is the first cause of cache pollution as well as additional memory requests. All existing prefetch schemes are impaired by wrong predictions because they speculate on the next address to be referenced. In this paper, we show that all required informations to avoid the speculative aspect of prefetching can be easily obtained from the compiler, resulting in nearly no wrong predictions. Even when address prediction is flawless, prefetching can be hazardous to cache because cache checks (required before sending a prefetch request to limit memory traffic) and cache reloads of incoming prefetch requests can result in cache stalls and thus processor stalls, particularly in superscalar processors where the cache can be accessed every cycle. In this paper, we show that addressing these implementation issues can make a prefetching scheme nearly transparent to normal cache operations. We have combined software-assisted address prediction with dedicated hardware support and obtained a prefetching scheme called streaming prefetch where data can flow through the cache nearly without disruption.

Keywords: prefetching, memory architecture co-design, cache, software-assisted caches, data locality, numerical codes.

Workshop: W20, Instruction-Level Parallelism.

Part of this work was done while the author was at IRISA, Rennes, France and University of Leiden, Leiden, The Netherlands.

?This work was supported by the Esprit Agency DG XIII under Grant No. APPARC 6634 BRA III.