Efficient Parallel Global Garbage Collection
on Massively Parallel Computers
Tomio Kamada, Satoshi Matsuoka, Akinori Yonezawa
The University of Tokyo
On distributed-memory high-performance MPPs where processors are interconnected by an asynchronous network, efficient Garbage Collection (GC) becomes difficult due to inter-node references and references within pending, unprocessed messages. Our parallel global GC algorithm (1) takes advantage of reference locality, (2) efficiently traverses references over nodes, (3) admits minimum pause time of ongoing computations, and (4) has been shown to scale up to 1024 node MPPs. The algorithm employs a global weight counting scheme to substantially reduce message traffic. The two methods for confirming the arrival of pending messages are used: one counts numbers of messages and the other uses network `bulldozing.'
Performance evaluation in actual implementations on a multicomputer with 32?1024 nodes, Fujitsu AP1000, reveals various favorable properties of the algorithm.
Much of previous programming systems for MPPs (Massively Parallel Processors) have either precluded dynamic storage allocation, such as FORTRAN, or allowed dynamic allocation but were not `pointer-safe', such as C, and required the programmer to manage storage with explicit malloc() and free() calls. In the sequential programming domain, however, it is well known that most recent breed of programming languages based on well-founded programming paradigms (functional, object-oriented, logic programming, etc.) and of actual practical use are pointersafe, and manage storage automatically with garbage collection (GC). Examples are Common Lisp, Smalltalk, Eiffel, Standard ML, Prolog, etc. Even for C++, several proposals have been made for GC (e.g., ), and within the C++ Standardization Committee, GC is one of the major agenda.
As the application of parallel programming broadens from specialized fields such as numerical computing to more general computing areas such as knowledge processing or business computing, there is a growing need for a better, more advanced languages (which could be extensions
to previously mentioned sequential counterparts.) Indeed, there have been some experimental programming languages for small- to medium-scale distributed-memory parallel architectures that facilitated automated storage management. Some notable examples are MultiLisp, KL1, and POOL/T. Also, in distributed computing, several distributed GC algorithms have been proposed in the past. But for both of such cases, realistic applications to commercially-available, large-scale MPPs have not been entirely successful. This is because the algorithms (1) incurred excessive message traffic, (2) involved prohibitive runtime overhead, (3) did not properly scale to larger number of processors, and/or (4) have not been implemented to test their validity/efficiency.
In general, GC algorithms for distributed-memory architectures can be largely classified into two types:
ffl Using reference counts: This type of GC algorithm mainly performs local GC on each node and uses reference counting to decide which objects are referenced from other nodes [4, 24, 13, 18]. However, no algorithms we know to date can collect cyclic garbage ranging over nodes with low runtime overhead required for parallel computation.
ffl Distributed marking over nodes: This type of GC algorithm is an extension of the standard mark-and-sweep algorithm, and traverses/marks both local and internode references[2, 10, 11, 22]. They can collect cyclic garbages, but many involve long pause time and/or excessive message traffic. Messages by the garbage collector not only marks objects in other nodes, but also used to detect termination of marking traversal. Some algorithms allow independent local GC on each node to exploit reference locality[10, 11, 22].
We present an efficient real-life distributed GC algorithm based on an asynchronous message passing model and distributed marking. It assumes distributed-memory highperformance MPPs where processors are interconnected by a high-speed asynchronous network. The first prototype has been implemented and running on a 32?1024 node MPP, Fujitsu AP1000. As far as we know, it is the first GC algorithm to have been shown to scale to 1024 nodes.