page 2  (36 pages)
to previous section1
3to next section

1 Introduction

Large scale shared-memory architectures provide shared address space supported by physically distributed memory. The strength of such systems comes from combining the scalability of network-based architectures with the convenience of the shared-memory programming model. Computing performance on such systems is affected by two important architecture and system design factors: the interconnection network and the memory system structure. The choice of interconnection networks to link processor nodes to cache/memory modules can make non-uniform memory access (NUMA) times vary drastically, depending upon the particular access patterns involved. A hierarchical ring structure is an interesting base on which to build large scale shared-memory multiprocessors. In large NUMA architectures, the memory system organization also affects communication latency. With respect to the kinds of memory organizations utilized, NUMA memory systems can be classified into the following three types in terms of data migration and coherence:

ffl Non-CC-NUMA stands for non-cache-coherent NUMA. This type of architecture either supports no local caches at all (e.g. the BBN GP1000 [1]), or provides a local cache that disallows caching of shared data in order to avoid the cache coherence problem (e.g. the BBN TC2000 [1]).

ffl CC-NUMA stands for cache-coherent NUMA, where each processor node consists of a processor with an associated cache, and a designated portion of the global shared memory. Cache coherence for a large scale shared-memory system is usually maintained by a directory-based protocol. Examples of such a system are the Stanford DASH [8], and the University of Toronto's Hector [13].

ffl COMA stands for cache-only memory architecture. Like CC-NUMA, each processor node has a processor, a cache, and a designated portion of the global shared memory. The difference, however, is that the memory associated with each node is augmented to act as a large cache. Consistency among cache blocks in the system is maintained using a cache coherence protocol. A COMA system allows transparent migration and replication of data items to the nodes where they are referenced. Example systems are the Kendall Square Research's KSR-1 [7], and the Swedish Institute of Computer Science's Data Diffusion Machine (DDM) [5].

A comparative performance evaluation between CC-NUMA and COMA models has been conducted by using simulations in [12], where dynamic network contention is not a major consideration. In addition, only 16 processors were simulated on relatively small problem sizes. However, both CC-NUMA and COMA systems are targeted at large scale architectures on large problem sizes. Another experimental measurement has been recently conducted to compare performance of the DASH (CC-NUMA on cluster networks) and the KSR-1 (COMA on hierarchical rings) [11]. We believe that further work needs to be done in order to more precisely and completely provide insight into the overhead effects inherent in the two memory systems. First, a comparative evaluation needs to carefully take into consideration the network contention which varies between the two memory system designs and among different interconnection network architectures. Second, a comparative evaluation between the two memory systems