A Multithreaded Implementation of Id using
Rishiyur S. Nikhil
Digital Equipment Corporation, Cambridge Research Laboratory
One Kendall Square, Building 700, Cambridge MA 02139, USA
Abstract. P-RISC graphs are parallel, structured, executable controlflow graphs that model locality properties of distributed memory machines. P-RISC graphs have coarse grain parallelism for distribution of work to processors, and fine grain parallelism to overlap latencies of remote accesses and synchronization. These properties support dynamic and irregular parallelism. In this paper, we describe P-RISC graphs, how they are used to implement the implicitly parallel programming language Id, and how they are implemented on distributed memory machines. We also contrast our approach to Berkeley's TAM, a similar system.
HPF  is an example of a trend back towards shared memory programming models, even though massively parallel machines are implemented with distributed memories. This approach is regaining popularity because it is now widely recognized that writing explicit message-passing programs is extremely hard even for a fixed configuration, and even harder if it is to be portable and tunable. However, for good performance on distributed memory systems, locality is very important. Today, it is infeasible to achieve completely automatic distribution of data and processes for good locality. The HPF solution is to rely on the programmer to specify data distribution of arrays. The compiler analyzes this information and uses it to distribute loop iterations close to their data; to batch communications (reducing overhead), and to overlap communication with computation (reducing idling) .
We are interested in more general shared memory programs where data structures may be irregular (trees, graphs, sparse arrays, ...), may be dynamically allocated, and may have data-dependent shapes and sizes. Further, such programs obtain much parallelism from producer-consumer situations, where a data structure is consumed by one process even as it is being built by another. In this environment, the compiler cannot orchestrate process and data placement in the manner of HPF compilers. Almost any individual data accesses may be remote, and may need synchronization.
Our general approach is based on multithreading. Since it is generally not known until runtime whether a particular data access is local or remote, or whether it will block (consumer ahead of producer), the processor must be prepared to multiplex efficiently to another thread instead of idling. But this raises