Technical Report TRCS97-19: Accurate Indirect Branch Prediction
In C++ and Java programs, indirect branches occur with even higher frequency (see Table 1). These languages promote a polymorphic programming style in which late binding of subroutine invocations is the main instrument for modular code design. Virtual function tables, the implementation of choice for most C++ and Java compilers, execute an indirect branch for every polymorphic call. The C++ programs studied here execute an indirect branch as frequently as once every 50 instructions; other studies [CGZ94] have shown similar results. Some of the C++ programs in Table 1 execute only 6 conditional branches for every indirect branch.
Predictated instructions [M+94] further increase the importance of indirect branch prediction since they remove conditional branches and thus conditional branch misses. For example, Intel expects predication to reduce the number of conditional branches by half for the IA-64 architecture ([Intel97]). With indirect branches becoming more frequent relative to conditional branches, and with indirect branches being mispredicted much more frequently, indirect branch prediction misses can start to dominate the overall branch misprediction cost. For example, if indirect branches are mispredicted 12 times more frequently (36% vs. 3% miss ratio), indirect branch misses will dominate conditional branch misses as long as indirect branches occur more frequently than every 12 conditional branches.
As the relevance of indirect branches grows, so does the opportunity for more sophisticated prediction mechanisms. In the next decade, uniprocessors may reach one billion transistors, with 48 million transistors dedicated to branch prediction ([P+97]).
In this study, we explore the design space of prediction mechanisms that are exclusively dedicated to indirect branches. Since the link between misprediction rate and execution overhead has been demonstrated in [CHP97], we focus on the minimization of branch misprediction rate. Initially, we assume unlimited hardware resources so that results are not obscured by implementation artifacts such as interference in tagless tables. We then progressively introduce hardware constraints, each of which causes a new type of interference and corresponding performance loss. We repeat this process until we obtain implementable predictors. Finally, the practical predictors are pairwise combined into a hybrid predictor, further improving prediction accuracy.
Our main benchmark suite consists of large object-oriented C++ applications that range from 8,000 to over 75,000 non-blank lines of C++ code each (see Table 1), and beta, a compiler for the Beta programming language ([MMN93]), written in Beta. We also measured the SPECint95 benchmark suite (see Table 2) with the exception of compress which executes only 590 branches during a complete run. Together, the benchmarks represent over 500,000 non-comment source lines.
All C and C++ programs except self1 were compiled with GNU gcc 2.7.2 (options -O2 -multrasparc plus static linking) and run under the shade instruction-level simulator [CK93] to obtain traces of all indirect branches. Procedure returns were excluded because they can be predicted accurately with a return address stack [KE91]. All programs were run to completion or until six million indirect branches were executed.2 In jhm and self we excluded the initialization phases by skipping the first 5 and 6 million indirect branches, respectively.
1 self does not execute correctly when compiled with -O2 and was thus compiled with ?-O? optimization. Also, self was not fully statically linked; our experiments exclude instructions executed in dynamically-linked libraries.