page 2  (9 pages)
to previous section1
3to next section

method. Three serious limitations of this approach are 1) the complex overhead pattern inherent in the program and the architecture may not be explicitly and precisely evaluated by the term ts in (1); 2) the model assumes the computation load is balanced; and 3) the model can not be used to evaluate the performance when the problem is scaled.

Memory-bound scaling model. Scaling the size of a problem is an important function in large-scale scientific computations. The memory-bound scaling measurement scales the problem so that the computation load and memory allocation in each processor is big enough to achieve good parallel performance. The limitation of this approach is that the amount of execution time and memory allocation may be unacceptably large in practice. An example of this scaling method is found in [3]. We will study other limitations of this model later in the paper.

Time-bound scaling model. This method scales the problem so that the execution time is kept as a constant as the number of processors increases [5]. This approach constrains the execution time but may not apply to all the parallel program cases in practice. In later sections, we will also show another limit of this model in measuring the scalability in practice.

Efficiency conserved model. This method scales the problem so that the parallel computing efficiency is kept as a constant as the number of processors is increased. The model is the foundation of the isoefficiency function [2]. We will show that this model is fair and reasonable in measuring scalability.

So far, most scalability studies have been done by considering the input data size of a problem as the only problem size parameter. In practice, the growth and scaling of an application problem is more complicated. In many large scientific simulations of physical phenomena, more than one parameter is used to change the size of the problem. The execution behavior may be significantly different between a program with a single input parameter, and a program with multiple input parameters, due to different program execution structures. In addition, the ways of changing multiple input parameters can significantly affect scaling of execution characteristics. In this paper we present an application program with multiple input parameters for scalability study.

The organization of this paper is as follows. Section 2 gives an overview of the two current available metrics and evaluates their merits and limits. We present our latency metric in section 3. Section 4 presents an application program of a large physics simulation and its scaling methods for testing the latency metric and different measurement models. Section 5 presents an comparative evaluation of the metrics using the different measurement models. The experiments of performing the physics simulation were run on the KSR-1 multiprocessor system, and supported by S-Graph, a software visualization environment. Finally we summarize the work in section 6.

2 Scalability metrics from efficiency

and speed

The definition of scalability comes from Amdahl's law which is tied to efficiency and speedup. There are two important scalability metrics: the isoefficiency function based on parallel computing efficiency [2], and the isospeed metric based on parallel computing speed [6]. The isoefficiency function of a parallel system is determined by abstracting the size of a computing problem as a function of the number of processors, subject to maintaining a desired parallel efficiency (between and 1).
The isoefficiency function first captures, in a single expression, the effects of characteristics of the parallel algorithm as well as the parallel architecture on which it is implemented. In addition, the isoefficiency function shows that it is necessary to vary the size of a problem on a size-changeable parallel architecture so that the processing efficiency of each processor can remain constant. However, there are two limits in this metric to the experimental evaluation of the scalability. First, analytical forms of the program and architecture overhead patterns in a shared-memory architecture may not be as easy to model as in a distributed memory architecture. This is mainly because the computing processes involved in a shared-memory system consist of process scheduling, cache coherence and other low level program and architecture dependent operations which are more complicated than messagepassing on a distributed memory system. It would be difficult to use the isoefficiency metric to precisely evaluate the scalability for a program runing on a shared-memory system in practice. Second, the metric may not be used to measure the scalability of the algorithm-architecture combination through machine measurements. Of course, experiments can be used to verify the analytical isoefficiency for the algorithm on a specific architecture. We believe this metric is more appropriate to evaluate the scalability of parallel algorithms.

Sun and Rover [6] take another approach to algorithm-machine combinations. Their metric starts by defining an average unit speed. Scalability is defined as an average increase of the amount of work on each processor needed to keep its speed constant when the size of the parallel architecture increases from N processors to N processors. This metric provides
more information about architectures and programs. However, we believe there are two limits in the isospeed metric for precisely measuring and evaluating the scalabilities of the application program and the architecture. First, some non-floating point operations can cause major performance changes. For example, a single assignment to a shared variable in a cache coherent shared-memory system may generate a sequence of remote memory/cache accesses and data invalidations. But this type of operation is excluded in the measurement of the scalability. Second, the latency is included in the total execution time in the metric, but it is not defined in the amount of work, W , in the scalability metric. In practice, the execution overhead caused by the interconnection network