lems such as Jacobi and TSP, moderate speedups for Water, and very little speedup for Cholesky. Regardless of the considerable bandwidth available on these networks, Cholesky's performance is constrained by the very high number of synchronization operations.
Among the protocols for implementing software release consistency, we distinguish between eager and lazy protocols. Eager protocols push modifications to all cachers at synchronization variable releases . In contrast, lazy protocols  pull the modifications at synchronization variable acquires, and communicate only with the acquirer. Both eager and lazy release consistency can be implemented using either invalidate or update protocols. We present a new lazy hybrid protocol that combines the benefits of update and invalidate: few access misses, low data and message counts, and low lock acquisition latency.
Our simulations indicate that the lazy algorithm and the hybrid protocol significantly improve the performance of medium-grained programs, those on the boundary of what can be supported efficiently by a software DSM. Communication in coarse-grained programs is sufficiently rare that the choice of protocols becomes less important. The eager algorithms perform slightly better for TSP because the branch-and-bound algorithm benefits from the early updates in the eager protocols (see Section 6.2). For the fine-grained programs, lazy release consistency and the hybrid protocol reduce the number of messages and the amount of data drastically, but the communication requirements are still beyond what can be supported efficiently on a software DSM. For these kinds of applications, techniques such as multithreading and code restructuring may prove useful.
The outline of the rest of this paper is as follows. Section 2 briefly reviews release consistency, and the eager and lazy implementation algorithms. Section 3 describes the hybrid protocol. Section 4 details the implementation of the protocols we simulated. Section 5 discusses our simulation methodology, and Section 6 presents the simulation results. We briefly survey related work in Section 7 and conclude in Section 8.
2 Release Consistency
For completeness, we reiterate in this section the main concepts behind release consistency (RC) , eager release consistency (ERC) , and lazy release consistency (LRC) .
RC  is a form of relaxed memory consistency that allows the effects of shared memory accesses to be delayed until selected synchronization accesses occur. Simplifying matters somewhat, shared memory accesses
are labeled either as ordinary or as synchronization accesses, with the latter category further divided into acquire and release accesses. Acquires and releases may be thought of as conventional synchronization operations on a lock, but other synchronization mechanisms can be mapped on to this model as well. Essentially, RC requires ordinary shared memory accesses to be performed only when a subsequent release by the same processor is performed. RC implementations can delay the effects of shared memory accesses as long as they meet this constraint.
For instance, the DASH  implementation of RC buffers and pipelines writes without blocking the processor. A subsequent release is not allowed to perform (i.e., the corresponding lock cannot be granted to another processor) until acknowledgments have been received for all outstanding invalidations. While this strategy masks latency, in a software implementation it is also important to reduce the number of messages sent because of the high per message cost.
In an eager software implementation of RC such as Munin's multiple-writer protocol , a processor delays propagating its modifications of shared data until it executes a release (see Figures 1 and 2). Lazy implementations of RC further delay the propagation of modifications until the acquire. At that time, the last releaser piggybacks a set of write notices on the lock grant message sent to the acquirer. These write notices describe the shared data modifications that precede the acquire according to the happened-before-1 partial order . The happened-before-1 partial order is essentially the union of the total processor order of the memory accesses on each individual processor and the partial order of release-acquire pairs. The happened-before-1 partial order can be represented efficiently by tagging write notices with vector timestamps . At acquire time, the acquiring processor determines the pages for which the incoming write notices contain vector timestamps larger than the timestamp of its copy of that page in memory. For those pages, the shared data modifications described in the write notices must be reflected in the acquirer's copy either by invalidating or by updating that copy. The tradeoffs between invalidate and update and a new hybrid protocol are discussed in the next section.
3 A Hybrid Protocol for LRC
A lazy invalidate protocol invalidates the local copy of a page for which a write notice with a larger timestamp is received (see Figure 3). The lazy update protocol never invalidates pages to maintain consistency. Instead, acquiring processes retrieve all modifications named by