A Program-driven Simulation Model of an
Department of Computer Engineering, Lund University
P.O. Box 118, S-221 00 Lund, Sweden
A simulation model that supports very accurate modeling of multiprocessors with a hierarchical, packet-switched interconnection network and private caches is explored. The simulation system contains workload simulators and a memory system simulator. The workload simulators are programdriven, i.e. they actually execute programs. The time unit of the simulator is the time between two consecutive memory references from the processors. The performance of the simulation model, although acceptable, could be improved using a trace-driven approach. We show that results obtained from trace-driven simulation methods in the course of multiprocessor performance evaluation are generally not valid. Furthermore, we show that in the evaluation of certain processor architectural features, such as non-blocking architectures, the program-driven approach is necessary.
Shared-memory multiprocessors constitute an important class of computers to facilitate the eternal need of increased computing power. As new architectures are being proposed, a need for evaluating the performance of the interconnection network, memory system, and cache coherence protocol, appears. This can be done by either building prototypes, simulation or analytic methods. It is not feasible to build such systems, and it is practically impossible to derive accurate analytic models. Therefore, it is commonly agreed that simulation is the only feasible approach.
Traditionally, simulators are trace-driven. This means that the simulation is based on memory references recorded when running the program [Fer78]. This method is extensively used to evaluate uniprocessor caches, see e.g. [Smi87], and there are trace capturing methods to include
references from system calls and the operating system [ASH86]. Trace-driven simulation has also been used to analyze multiprocessors [EK89].
Several problems of the trace-driven approach when exploring multiprocessors are pointed out by Bitar [Bit90]. The traces suffer from being dependent on the host architecture, which may introduce incorrect performance results. When simulating a new architecture, the order of events, the latency of the network, and the number of processors might be completely different from the trace-host, which makes the traces useless. Synchronization skewing, for example, may give rise to unreal sharing.
Another problem with trace-driven simulation is when exploring a non-blocking processor with a lockup-free cache [SD88]. Data-dependency constitutes an important limit for the increased performance of non-blocking processors [DS90]. The fact that the traces must contain information on data dependency, e.g. between registers, complicates the trace-capturing process and restricts the flexibility of this kind of simulations.
This report presents the design and implementation of a program-driven simulator of an MIMD multiprocessor [Fly66]. In a program-driven simulator, the processing elements are executing real code, and thereby can run applications. Since every action in the processors is simulated, the multiprocessor can be accurately modeled, and the problems with architecture dependency no longer exists. Synchronizations and the actions due to processor communication can be accurately analyzed, and complete knowledge about the data dependency is obtained. A disadvantage of this type of simulators is the execution-intensity making it very time-consuming. Therefore, the objective of the simulator approach reported in this paper has been accuracy and efficiency. The approach has shown to be advantageous when analyzing synchronization traffic and cache coherence schemes on a fine-grain basis, and when analyzing lockup-
As a case study, a parallel simulator model of the Data Diffusion Machine (DDM) [WH88] has been implemented and is written in Ada. The DDM is highly parallel (a few hundreds to one thousand processors) and maintains cache coherence through a cache consistency protocol, which were the facts that made it interesting as the case study for this project. The simulator executes code, and programs have been successfully executed and the machine/program behavior analyzed. The size of the machine simulated, i.e. the number of processing elements, and the configuration of these are adjustable through parameters, as well as the size of the memories. Therefore, a large design space of the base architecture can be evaluated in a reasonable amount of time.
Another application of this simulation approach has been to study non-blocking processors and lockup-free caches for uniprocessors [Kro81]. A simulator of a non-blocking processor was implemented, and used as the workload.
The implementation of the simulator is highly parallel, and efforts have been made to reduce time-consuming interprocess communication.
The outline of the paper is the following. In section 2 different approaches to simulate the workload and to manage simulated time are described. Section 3 describes the simulator structure of the DDM, 4 describes the performance of this simulator, and 5 discusses validation of this type of simulation methods. Applications of the simulator are found in section 6, and conclusions and discussions in section 7.
2 Simulator design approaches
This section describes different simulation techniques. 2.1 describes different approaches to how to simulate the workload, and 2.2 describes different methods to manage simulated time.
2.1 Simulation of the workload
There are different types of models which can be used to simulate a multiprocessor's workload. A first distinction is between deterministic and stochastic models. In a stochastic model at least one variable is random, e.g., the processing elements generate transactions randomly based on statistics on reference locality etc. There are basically two different
types of deterministic models: those consisting of a trace of input events, typically obtained from measurements on a system similar to the one being executed, and those which describe the workload in program form.
Thus, there are three types of simulator workloads:
where distribution-driven includes all workloads based on stochastic models.
A distribution-driven workload is typically based on statistics on reference locality, program behavior, read/write ratio, fraction of shared data etc. The correlations among job parameters are very hard to implement in a distribution-driven simulator. These are usually partially neglected which result in less accurate simulation results. The lack of knowledge about program behavior for different types of programs written in different languages and run on a multiprocessor similar to the one simulated is also a fact that has to be taken into account. The advantage of this approach is that the simulator is very fast.
The problem with correlations among job parameters is not present in a trace-driven simulator, since the correlations are implicitly contained in a trace. This approach has other disadvantages however, due to the fact that the traces are dependent on the source architecture. The simulated architecture ought to be similar to that on which the trace was obtained, and this could be a severe problem when simulating a multiprocessor with an architecture not very common and with the number of processing elements adjustable. Using traces obtained from a sequential system is usually inappropriate for use when simulating a parallel system, since the order of execution is usually dependent on the order in which different parts of a parallel program complete execution [MB90]. Another problem is the complexity of the trace-capturing if all information about data dependency should be contained in the traces.
A program-driven workload [Fer78], does not suffer from the problems stated above, but involves the problems one encounters in designing representative workloads. To achieve appropriate accuracy, the programs of the workload have to be designed very carefully, which usually means that they are complex and therefore time-consuming. A programdriven simulator is very flexible. A program-driven workload is typically based on a program behaving exactly the
same as the processor that runs the code. In this case, the workload can be made to execute programs, e.g. benchmarks, whose codes have been compiled to the input format of the workload. This gives rise to great possibilities, but one has to take into account the fact that the more accurate the model is, the more time-consuming it is. Since the hardware is visible, the problem of data dependency is eliminated.
2.2 Management of simulated time
There are different approaches for how to build a simulator system. The relationship between simulated time and events in the system gives rise to simulation techniques based on the following approaches:
ffl event-scheduling approach
ffl fixed-step activity-scanning approach
ffl variable-step activity-scanning approach
ffl optimistic methods
In the event-scheduling approach, the technique of advancing the clock is by variable steps skipping the intervals containing no events. This technique is called the next event technique. The emphasis is clearly on the events. Each event may give rise to one or more events, which are ordered in an event list. The ordering in an event list is based on a time-stamp for each element in the list, showing when the event is to be executed. The first event in the event list is the first event to be executed, that is, if the simulated system is closed, i.e. no event outside the system can give rise to events within it.
If the simulator system consists of subsystems, each of which are based on the event technique, the next event to be executed in a certain subsystem is not necessarily the first element in the event list. An event in another subsystem could generate an event in the subsystem, and this event could have a time stamp lower than the first element in the event list. Such a system can be quite complicated to simulate on a multiprocessor, because of the dependence between the subsystems. Different solutions to this problem have been proposed (see [Aya89]). On a sequential machine, the next event is the event with the smallest time stamp in the whole system.
In the fixed-step activity-scanning approach [Fer78], emphasis is on activities, and time is advanced by constant
steps. At each step the activities are analyzed in order to determine whether to start or terminate any of them. An activity routine handles the control of the activity. A start or termination of an activity is in some ways treated as conditional events.
If the time of the duration of the activities in the system is predictable, i.e. can be precisely calculated, and if the starts and stops of activities are unevenly spread in time although the time resolution must be high, the fixed-step technique for advancing the simulated time is not very appropriate. If nothing is happening most of the steps, the time spent analyzing the activities in the system is often wasted time. It would be preferable to advance the time to the next calculated (unconditional) event. This is, of course, only if there is no conditional event, because this shall perhaps be activated before the next calculated. After an action, the activities are analyzed in order to calculate their time of start or stop, and if no activity remains conditional, i.e. the start or stop was not possible to calculate, the time can be advanced to the next calculated event. The described method is called the variable-step activity-scanning approach [Fer78], and is in some respects a combining of the fixed-step technique (in this case when not all actions are predictable) and the next event technique.
The first three approaches could all be regarded as examples of the conservative method. In this method [Mis86], a system is modeled as a directed graph. The nodes are logical processes representing the objects of the system, and the archs represent communication channels between them. Within each logical process, events are simulated strictly in the order of their simulated time. Interprocess communication is required to simulate the messages sent in the system, and when the object affect each other. It is assumed that this interprocess communication preserves the order of messages. If the system does not contain a global clock, i.e. the objects contain their own clock, all messages are time stamped. The time stamp of the messages sent along any particular arc must be nondecreasing. The method is conservative, since a logical process is not allowed to process a message with a time-stamp t until it is certain that no messages will ever arrive with a time-stamp less than t.
There exist a number of algorithms for speeding up a parallel simulator based on optimistic methods, such as the Virtual Time approach [Jef85]. In optimistic methods, the objects are not synchronized by a global clock, so the objects contain their own local clock. In contrast to conservative methods, the objects are always allowed to continue working. They update their local clock due to the work they do. All messages in the system are time stamped. If a message arrive
to an object, and shall give rise to an event at a time t lower than the local clock of the object, and actions are already performed in that object at times higher than t, all actions in the object since time t must be rollbacked, i.e. the system mustundo these actions and all effects of them. Such actions could be messages sent, which means that these messages, and all effects of them in other objects, must be rollbacked as well.
This section has described different techniques possible when implementing the structure of a simulator model of a multiprocessor. Next section will explain the approach used in the case study, and use the descriptions in this section to motivate the choice of these.
3 Simulator structure
Section 3.1 briefly describes the DDM architecture, section 3.2 the techniques used for the case study simulator, section 3.3 the structure of this simulator, and section 3.4 the memory management.
3.1 The Data Diffusion Machine
The multiprocessor simulated as a case study is the Data Diffusion Machine (DDM) proposed by David Warren and Seif Haridi [WH88]. The DDM is a hierarchical bus/tree architecture with shared virtual memory. Such an architecture has been proposed by Wilson [Wil87], but in contrast to Wilsons architecture where the physically shared memory provides a `home' location of the data, data in the DDM migrate automatically to where they are needed. This reduces access times and network traffic. The location of a datum in the machine is completely decoupled from its virtual address. A cache/directory map of the Data Diffusion Machine is shown in figure 1.
The hardware organization consists of a hierarchy of buses and data controllers linking an arbitrary number of processors each having a large set-associative memory. Each data controller has a set-associative directory containing status information for data under its control. The controller supports remote data access by snooping on the bus above and below it. The cache consistency protocol it uses provides for the automatic migration, duplication, invalidation, and replacement of data while maintaining data coherency [HH89].
The bus that connects the processing elements in a cluster, the cluster bus, is a high speed bus with a bus delay much shorter than the rate of memory references possible for a processor. This is necessary in order to reduce the load of the bus. The cluster bus connects a cluster of similar configurations of processor, cache, memory, and controller. The local bus may itself be connected via a controller to a higher bus, and so on up the hierarchy. A directory controller, i.e. a higher level controller, has a directory containing status information for all blocks present in the subsystem below. The directory controller decides whether a transaction on the bus above concerns its subsystem or not, and whether a transaction on the bus below is local or has to be sent further up the hierarchy. On top of the hierarchy there is a top-directory controller, i.e. a directory controller, but without the functions concerning transactions on a bus above it. As a consequence of what is said above, the memory blocks present in the top-directory are the ones present in the whole system.
The DDM is supposed to have relatively high hit-ratio within each sub-tree due to the large cache memories.
3.2 Simulation techniques used
As a case study, a parallel simulator model of the Data Diffusion Machine (DDM) [WH88] has been implemented and is written in Ada. The DDM is highly parallel (a few hundreds to one thousand processors) and maintains cache coherence through a cache consistency protocol, which were the facts that made it interesting as the case study for this project.
The reason for choosing Ada is because all primitives for concurrent programming are available directly in the highlevel language, and since Ada is standardized it is possible to run the simulator on different parallel machines without having to change the code.
The workload of the DDM simulator is program-driven. The most important properties of the simulator should be accuracy and possibility to study the behavior of different architectures when executed on the DDM. The problems of accurate analysis of synchronizations and the dependence of the source architecture of the traces excluded the tracedriven approach as a solution.
In the simulator model of the DDM, a time unit is the time between two consecutive memory references generated by the processors. Since a processor may be waiting for an outstanding request, and thus cannot always generate a new
DMPD = Directory
M = Memory
P = Processor
MMMDDPPPT DT D = Topdirectory
Figure 1: The Data Diffusion Machine
memory reference, a memory reference equivalent, i.e. a memory reference if possible and nothing in the case of waiting for an outstanding request, is used as the time unit. A memory reference equivalent will be referred to as a cycle.
The simulation approach to the time advancing for the processors is chosen to be the fixed-step activity-scanning approach described above. The motivation for this is that since a relatively high hit-ratio is assumed, a processor will have work to do most of the cycles. As the DDM is a hierarchical system, with transactions propagating through the communication network, the choice of simulating technique for the rest of the objects is based on the fact that all times of actions can be calculated. This means that the natural choice for the network model is based on the event-scheduling approach. The event-scheduling could either be based on one event list for the whole network, or one event list for each item. The choice depends on the expected number of events. The first is normally chosen when the expected number of events in the system is very low. The latter is chosen for the DDM simulator. The system contains one global clock.
An optimistic approach is not applicable for the DDM simulator, because of the large difference in computing requirements of the different kinds of objects in the system (e.g. the processors vs. directory controllers or busses). The processors are much more time-consuming, and this discrepancy
could lead to extensive rollback.
There are basically two characteristics of the programdriven approach to simulating a multiprocessor that led to this structure; a large number of objects in the system, and a huge difference in computing requirements of the different kinds of objects in the system.
3.3 The simulator structure of the DDM
In the DDM simulator each workload, i.e. each processing element (PE), is an MC68000 simulator extended with the test&set synchronization primitive. The MC68000 simulator runs binary code in a program-driven fashion. In the simulator model of the DDM, the time unit is one cycle.
The simulation approach to the time advancing of the processors is chosen to be the fixed-step activity-scanning approach. The timing of the network model is based on the event-scheduling approach, with one event-list for each module.
The types of modules of the DDM simulator are the following:
ffl Processing element cluster (PE-cluster)
ffl Synchronization module
where a PE-cluster consists of the lowest level bus and the processing elements connected to it. The synchronization module takes care of the update of the global clock and the synchronization of the PE-clusters. Each of the modules are implemented as individual processes. This leads to the process model shown in figure 2.
When all PEs have done one cycle, i.e. one memory reference equivalent, the synchronization module updates the global clock and a new cycle begins. The buses and directories can all read the global clock, and the simulations of the latencies in these modules are based on this. A PE is allowed to access the caches of the other PEs in the same cluster in order to reduce the complexity in, and the communications within the PE-clusters. When a transaction from a PE-cluster reaches a directory, the directory makes the necessary operations (if any), and if a transaction is to be sent, this transaction is placed in an event-queue in the directory (in order to simulate the latency of the directory). When the time has come to send the transaction on a bus, it is placed in either an output above buffer or an output below buffer. The transactions in the buffers are sent to the bus tasks as these are free. Also a bus task has an event-queue, but in contrast to the directories only one transaction at a time is allowed in a bus task. In order to minimize the amount of inter-process communication in the system, a bus is allowed to read the status of the actual block in the directory caches in order to see whether the transaction to be sent will result in any action in this directory. If not, the transaction is not sent to this directory.
3.4 Memory management
Both the directories and the big cache memories in the DDM consist of a block-set-associative cache. The cache memories are general purpose. The block-size, number of sets, and number of blocks per set of the cache memories are scalable through parameters.
In a DDM system the big memory cache is expected to be in the region of 2-8 Mbyte, but a simulation model, e.g. corresponding to a 256 processors system, cannot possibly contain all this memory. Since all processors are running the same program (the execution of a certain processor depends on its processor identity), it is not necessary to fully
implement the instruction part of the caches. Only status information of the blocks is necessary in order to correctly simulate the cache behavior, e.g. misses, hits, and replacement. The whole simulator does only have to contain one memory containing the instructions.
The same method to save memory is not applicable for data, since that assumes a correct behavior of the data, e.g. that false sharing can not occur. In order to accurately evaluate cache coherence protocols, such an assumption must not be built into the system. This means that the cache memories contain both the data-blocks and status information of these.
4 Performance of the DDM simulator
A key parameter for simulators is the slow-down factor, i.e. the ratio between real time and simulated time. In this case, real time is the time a Sun 3/80 needs to execute the same program. The more accurate and complex the simulator is, the higher is the slow-down factor.
In order to determine the slow-down factor of the DDM simulator, a 64 processors version without printouts was compiled and run. The number of cycles to be executed could be chosen. The results received was to be compensated for the time required for the initiations. The initiation time was received through a simulation with no cycles to be executed. The result received when running 100,000 cycles per processor (6,400,000 cycles for the system) was 1050 cycles per second.
The reference speed must be transformed from instructions per second to memory references per second. As we do not have any accurate figures on the average number of memory references per MC68000-assembler instruction, the assumption 1:5 ? 2 was made. If we assume the number of memory references per second to be 2 ? 106 in reality, the slow-down factor is approximately 2000.
It is important to find methods to verify a simulator. Not only shall the correctness of the network and protocol be verified, but it must be verified that the correct messages are sent at the right times. A typical problem of verifying multiprocessor simulation based on the trace-driven approach are the messages sent at synchronizations points, since the circumstances might be completely different for the simulated
DDDBBDPEPEPEPECCS= ProcessS = Synchronizer ProcessB = Bus ProcessC = Cluster ProcessD = Directory Process
Figure 2: The processes of the DDM Simulator
architecture than for the source architecture of the traces.
The program-driven approach, as described above, is shown to have certain properties advantageous for the verification process. Basically, the verification of this type of simulator can be separated into three parts:
1. By discussing the timing model and the basic simulation strategies with other scientists and research groups in order to confirm the principles of the structure.
2. Since the simulator actually executes code, the execution result can be analyzed in order to detect any errors. This verifies the functionality of the workloads, and verifies that the communication model works properly, although it does not verify the timing model or traffic load. Programs can be especially written in order to accurately test the protocol, and stress the sharing of data.
3. By the theory of queueing networks. These are based on stochastic behavior, and the distributions and characteristics of the objects can be based on information captured from the program-driven simulation. However, the statistics from the program-driven simulation must not be dependent on the network latency and amount of traffic. In other case, the effects of the network are already implicitly contained in the statistics, and it is not possible to obtain any good results from
the queueing networks model. Some parameters will always depend on network properties, e.g. contention will always affect the sharing of data and the synchronizations. Such effects must either be neglected or compensated for.
The mathematical models of queueing networks tend to be very complex for large multiprocessors. The model ought to be broken down into interacting subsystems, in order to obtain a more manageable model. It can be solved by either mathematical methods, or by a distribution driven simulator.
The results from the queueing network model, compared with those from the program-driven approach, verifies the figures of the network, e.g. processor utilization, network traffic, etc.
Since the system simulated does not exist, i.e. there is no reality to compare with, making sure that the conceptual model has been correctly translated into a simulator is not a trivial task. The problem is in principle very similar to the one of program correctness verification.
6 Application of the simulator
6.1 Synchronization traffic analysis
One of the advantages with the simulation approach is when analyzing synchronization traffic on a fine-grain basis. Measurements of execution time for different configurations of the simulated architecture under hot-spot contention [PN85] have been made. By making the program-driven simulator generate traces, and to make simulations based on these traces, comparisons between the program-driven approach and the trace-driven approach have been made.
The results show that traces collected when the same program was run on a different configuration can be very far from being representative, basically due to differences in locking of synchronization variables for the same program on different configurations. Table 1 shows the execution time and bus traffic load. The number of processors is chosen to be 16 in two levels, and the two DDM-configurations run were 2 clusters with 8 processing elements each (when collecting the traces), and 8 clusters with 2 processors each (when analyzing the behavior of program-driven and tracedriven). The traffic load is measured on the highest level bus in the hierarchy. The execution time is the largest execution time of the processors. The trace-driven simulations were verified by running them on the same topologyas when collecting them, and the behavior was exactly the same.
The program was a loop, including some calculations and a critical section protected by a semaphore implemented with the test&set synchronization primitive. Within the critical section, a shared variable was read and updated. There were two shared variables, one for the first 8 processors, and the other for the rest of them. Each shared variable was protected by its own semaphore. As can be seen, there were large differences in both execution times and bus load (see Table 1).
6.2 Lockup-free cache measurements
Cache misses can cause extensive processor stall, both in a uniprocessor and a multiprocessor environment. One method to reduce the processor stall, is to make the processor capable of continuing execution while waiting for the data. There are early examples of machines supporting this, e.g. IBM System/360 Model 91 [Tom67] and CDC 6600 [Tho64], and newer ones, such as the IBM RP3 [BMW85]. The processor can support either fully dynamic instruction
Simulation No. of Cluster Exec Bus traffic technique clusters size time load
Program 2 8 645 40
Trace 2 8 645 40
Program 8 2 2234 914 Trace 8 2 1079 669
Table 1: The execution times for the processors for different simulation approaches and topologies. The traces were collected from the program-driven simulation with 2 clusters with 8 processors each.
scheduling, and only block when a conditional branch does not have the appropriate flag set, or an easier type of nonblocking, where the processor may continue executing after a read miss until a data dependency with the missed data occur. We focus on the latter.
In order to support a non-blockingprocessor, the cache must be able to handle multiple outstanding requests. This will enable the processor to overlap computation with communication. Kroft [Kro81] suggested this type of uniprocessor cache, which is referred to as a lockup-free cache. Scheurich and Dubois [SD88] have proposed a lockup-free cache design for a multiprocessor.
Simulations of a non-blocking processor with a lockup-free cache have been made. The study is focused on loops with extensive accesses to arrays. Typical target applications are numerical algorithms where a large part of the time is spent in numerical loops, and consisting of several array and matrix traversals. One of the most well-known and well-established benchmark for numerical loops are the Livermore Loop suite, [McM84].
The results from the simulations in order to evaluate how the data dependency distance affects the processor stall are presented below. Table 2 shows how large fraction of the total number of cycles the processor was blocked for different loops and different dependency distances. In the best case, the data dependency distance was the largest possible for that loop, while in the worst case the use-instruction was allocated directly after the load instruction, i.e. almost as bad as for a blocking processor. Only in one of these loops it was possible to completely hide the cache misses, while the data dependency distance was shown to be so small, that the processor is not even able to halve the fraction of blocked cycles. The relation between the processor time between two consecutive read requests and the memory latency was
Loop Version Fraction blocked
1 Best case 0.108
1 Worst case 0.175
6 Best case 0.118
6 Worst case 0.175
7 Best case 0.017
7 Worst case 0.109
10 Best case 10 Worst case 0.118
Table 2: The fraction of blocking cycles, a comparison between the best and the worst cases possible for the loops.
chosen to be 10. In this simulation example, the interesting data captured from the simulator was the processor activity, i.e. number of cycles executing and stalled.
A simulation model that supports very accurate modeling of multiprocessors with a packet-switched interconnection network and private caches is explored. It is based on program-driven simulation of the processors, and every action of importance is simulated accurately.
The simulator model described does not suffer from being dependent on the source architecture, as the trace-driven simulator approach is. It is shown by an application that synchronization simulation under hot-spot contention based on trace-driven simulation, with traces from another architecture than the simulated, gives erroneous results. The program-driven approach, on the other hand, simulates every action connected to the synchronization, and the synchronization can be modeled both accurately and correct at any multiprocessor configuration.
When modeling non-blocking architectures and lockup-free caches, data dependency is of importance. Producing traces with full information on data dependency is impossible, if the instruction code is not interpreted in order to get information on registers affected by every instruction. Since this is exactly what a program-driven workload does, the approach is a necessary tool when exploring such systems. The simulator could also be used to produce traces with this information.
Generally, trace-driven simulation methods are not valid
for multiprocessor performance evaluation. Only in exceptional cases is trace-driven simulation appropriate.
As a case study, a simulator model of the DDM has been implemented. The simulator is implemented in Ada, which gives us the possibility to run the simulator on different parallel machines without having to change the code. We are currently exploring the parallelism of the simulator. The simulator handles 1050 memory references/second on a Sun 3/80.
A disadvantage of the method is that it is time-consuming. If the architecture of the simulated processors is not of particular interest, one way to improve the speed is to let the instructions be executed directly on the host. For the DDM simulator, this would improve the performance by at least a factor of 2.
The timing of the system is based on the fixed-step activityscanning approach for the processor clusters, because of the equal time between actions, and because actions are supposed to occur most of the time units. For the other objects of the system, e.g. buses and directories, eventscheduling is chosen, because these are not supposed to be as heavy loaded as the clusters. No other methods were quantitatively evaluated.
The reason for implementing a whole processor cluster as one process, is to reduce the amount of inter-process communication, which is very time-consuming in Ada. The number of processes in the system is still high enough for parallel executions of the simulator.
The analysis of lockup-free caches and non-blocking processors will continue, both for uniprocessors and multiprocessors. A lockup-free multiprocessor cache-controller supporting various access order models has been implemented, [SDL91], where the processors are supposed to support more than one memory reference at a time. Our simulation model seems very promising for the analysis of cache coherence schemes and synchronization traffic for various access order models. For that purpose, a new workload has been implemented, which simulates a processor with dynamic instruction scheduling.
I am deeply indebted to Dr Per Stenstr?om for his constructive and helpful comments and discussions, Prof. Lars Philipson and Dr Rassul Ayani who have come up with important
points of view, and Bjorn Breidegard for his excellent work with the processor simulator.
This work was supported by the Swedish National Board for Technical Development (STU) under contract number 85-3899 and 87-2427.
[ASH86] A. Agarwal, R.L. Sites, and M. Horowitz. ATUM: A New Technique for Capturing Address Traces Using Microcode. In Proc of 13'th International Symposium on Computer Architecture, pages 119?127, 1986.
[Aya89] R. Ayani. Parallel Simulation of Queuing Networks using Moving Time Windows. Technical report, Department of Telecommunications and Computer Systems, KTH, Sweden, February 1989. Report No. TRITA-TCS-8901.
[Bit90] P. Bitar. A Critique of Trace-driven Simulation of SharedMemory Multiprocessors. In Michel Dubois and Shreekant Thakkar, editor, Cache and Interconnect Architectures in Multiprocessors. Kluwer Academic Publishers, 1990.
[BMW85] W.C. Brantley, K.P. McAuliffe, and J. Weiss. RP3 ProcessorMemory Element. In Proc of 1985 International Conference on Parallel Processing, pages 782?789, 1985.
[DS90] F. Dahlgren and P. Stenstr?om. Methods to Improve the Performance of Data Caches. Technical report, Department of Computer Engineering, Lund University, Sweden, 1990.
[EK89] S.J. Eggers and R.H. Katz. Evaluating the Performance of four Snooping Cache Coherence Protocols. In Proc of 16'th International Symposium on Computer Architecture, pages 2? 15, 1989.
[Fer78] D. Ferrari. Computer Systems Performance Evaluation. Prentice-Hall, 1978.
[Fly66] M. J. Flynn. Very High-Speed Computing Systems. In Proc of IEEE, pages 1901?1909, Dec 1966.
[HH89] S. Haridi and E. Hagersten. The Cache Coherence Protocol of the Data Diffusion Machine. In Proceedings of PARLE 89 (Springer-Verlag), volume 1, pages 1?18, 1989.
[Jef85] D.R. Jefferson. Virtual Time. ACM Transaction on Programming Languages and System, 7(3):404?425, 1985.
[Kro81] D. Kroft. Lockup-Free Instruction Fetch/Prefetch Cache Organization. In Proc of 8'th International Symposiumon Computer Architecture, pages 81?87, 1981.
[MB90] R. Mukherjee and J. Bennett. Simulation of Parallel Computer Systems on a Shared Memory Multiprocessor. In Proc of 23rd Hawaii International Conferenceon System Science, volume 1, pages 242?251, 1990.
[McM84] F.H. McMahon. LLNL Fortran Kernels: MFlops. Technical report, Lawrence Livermore Laboratories, Livermore, CA, March 1984.
[Mis86] J. Misra. Distributed Discrete-Event Simulation. ACM Computing Surveys, 18(1):39?65, 1986.
[PN85] G.F. Pfister and V.A. Norton. ?Hot Spot? Contention and Combining in Multistage Interconnection Networks. IEEE Transactions on Computers, C-34(10):943?948, October 1985.
[SD88] C. Scheurich and M. Dubois. Concurrent Miss Resolution in Multiprocessor Caches. In Proc of 1988 International Conference on Parallel Processing, volume 1, pages 118?125, 1988.
[SDL91] P. Stenstr?om, F. Dahlgren, and L. Lundberg. A Lockup-free Multiprocessor Cache Design. Technical report, Department of Computer Engineering, Lund University, Sweden, 1991.
[Smi87] A. J. Smith. Line (Block) Size Choice for CPU Cache Memories. IEEE Transactions on Computers, C-36(9):1063?1075, 1987.
[Tho64] J.E. Thornton. Parallel Operation in the Control Data 6600. AFIPS Conference Proceedings FJCC, 16(2):33?40, 1964.
[Tom67] R.M. Tomasulo. An Efficient Algorithm for Exploiting Multiple Arithmetic Units. IBM Journal of Research and Development, 11:25?33, January 1967.
[WH88] David H. D. Warren and Seif Haridi. The Data Diffusion Machine ? a Scalable Shared Virtual Memory Architecture for Parallel Execution of Logical Programs. In Proceedings of the International on Fifth Generation Computer Systems, November 1988.
[Wil87] A. Wilson. Hierarchical Cache/Bus Architecture for Shared Memory Multiprocessors. In Proc of 14'th International Symposium on Computer Architecture, pages 244?252, 1987.