Efficient and Reusable Implementation of
Fine-Grain Multithreading and Garbage
Collection on Distributed-Memory Parallel Computers
Department of Information Science
Graduate School of Science
University of Tokyo
This thesis studies efficient runtime systems for parallelism management (multithreading) and memory management (garbage collection) on largescale distributed-memory parallel computers. Both are fundamental primitives for implementing high-level parallel programming languages that support dynamic parallelism and dynamic data structures.
A distinguishing feature of the developed multithreading system is that it tolerates a large number of threads in a single CPU while allowing direct reuse of existing sequential C compilers. In fact, it is able to turn any standard C procedure call into an asynchronous one. Having such a runtime system, the compiler of a high-level parallel programming language can fork a new thread simply by a C procedure call to a corresponding C function. A thread can block its execution by calling a library procedure that saves the stack frame of the thread and unwinds stack frames. To resume a thread, StackThreads provides another runtime routine that rebuilds the saved stack frame on top of the current stack and restarts the computation from the blocking point. All these operations are implemented by using information already present on standard C stack frames, without requiring a frame format customized for a particular programming language. Experiments demonstrate that potential performance problems are not significant in practice, even on distributed memory computers in which each remote access causes a thread switch.
The developed garbage collection system is a simple mark & sweep collector that stops the user program while collecting. We show viability of such collectors on a large scale (up to 256 processors) distributed memory computer (Fujitsu AP1000+). Under a reasonable heap expansion policy, garbage collection occupies at most 15% of the application time (excluding idle time). More importantly, the overhead of garbage collection on parallel machines was, except for one application, in the ballpark of that on a single processor, indicating that garbage collection is at least as scalable as the applications. Another observation from the experiment is that independent local collection is a dangerous strategy which degrades performance