cache block frame, are used to identify the nodes with a copy of a given data item. Consistency is maintained by traversing the list anytime this data item is modified. Traversing a list, however, can result in messages flowing over the same network links several times, especially in networks without point-to-point interconnections between all modules. For example, in a ring-connected system a message may have to traverse the entire ring n times in the worst case, if there are n active copies of the data to be invalidated. Barroso and Dubois  have proposed a scheme for a system of processors interconnected by a unidirectional ring that relies on snooping and thus avoids the multiple-traversal problem of the SCI protocol.
In this paper, we propose a selective-broadcast based cache consistency protocol that addresses the three complications listed above for a class of multiprocessors based on hierarchical rings. Ring-based networks have been investigated [1, 6, 10, 11, 16] as a means for implementing high performance interconnection backplanes because they offer a number of advantages. Having point-to-point interconnections, large rings can be driven at very high clock rates. Rings also exhibit natural broadcast and ordering properties that facilitate the implementation of cache consistency protocols. The proposed protocol can easily be used to achieve various consistency models, including sequential consistency  and processor consistency .
In the next section, we define the class of machines for which the protocol is targeted. Section 3 presents the new protocol. Section 4 discusses several performance issues and enhancements to the basic protocol. Finally, in Section 5, we analyze the performance of the protocol through address trace driven simulations.
2 The Architecture of the Target Mul-
The protocol presented in this paper has been developed for general multiprocessors that are based on hierarchical rings. In order to clarify the presentation, we will describe without loss of generality the protocol as it would apply to the Hector multiprocessor architecture .
The target architecture consists of clusters of processor and memory modules, interconnected by rings. In Hector, a cluster is called a station, within which a split-cycle bus is used as the interconnection medium. The hierarchical multiprocessor is formed by interconnecting sets of stations by local rings, which are then
interconnected by higher level rings. While the proposed scheme is scalable to an arbitrary number of levels, for simplicity we will assume the two-level ring hierarchy shown in Figure 1, comprising local rings interconnected by a single central ring.
Each memory module occupies a unique contiguous portion of a flat, global (physical) address space. The processor modules can transparently access all memory locations. The modules consist of a processor, cache memory, a cache controller and a communication submodule.
Information is transferred between processor and memory modules using a packetized synchronous transfer protocol. The interconnection network traffic consists of request packets and response packets. Communication submodules associated with each processor and memory module handle the sending and receiving of the packets. The transfer of packets between modules is managed by station controllers and interring interfaces. Each station controller is responsible for controlling on-station transfers as well as the traffic on the local ring in the vicinity of its station.
Each ring can be thought of as consisting of multiple segments in which, in any given cycle, a single packet may reside. Packets travel around the ring by being synchronously transferred from one ring segment to the next; packets are assumed to travel in the counter-clockwise direction. As long as a packet on a particular segment is not destined for the associated station, on-station and local ring transfers may occur concurrently. If a ring packet is to be delivered to a module on a station, its delivery takes precedence over on-station transfers. Packets destined for another station are switched onto the station bus and local ring if the local ring segment does not contain a valid packet.
The inter-ring interfaces require FIFO buffers to store packets due to the possibility of packet collisions. A collision will occur if in a given cycle, input packets from both rings are to be routed to the same output. A simple interface is shown in Figure 2. As with the delivery of packets to a station from a local ring, packets on the central ring have priority over those on the local ring.
3 The Cache-Consistency Protocol
In this section, we present a consistency protocol in the context of the target multiprocessor. The protocol exploits the broadcast capability of rings and uses snooping within stations to maintain the caches consistent. Packet filters within each network node