A Protocol Description Language for
Customizing Failure Semantics?
Daniel C. Sturman and Gul A. Agha
Department of Computer Science,
University of Illinois at Urbana-Champaign,
Urbana, IL 61801
fsturman | email@example.com
To optimize performance in a fault-tolerant distributed system, it is often necessary to enforce different failure semantics for different components. By choosing a custom set of failure semantics for each component and then by enforcing the semantics with a minimal set of protocols for a particular architecture, performance may be maximized while ensuring the desired system behavior. We have developed DIL, a language for specifying, on a per-component basis, protocols that transparently enforce failure semantics. These protocols may be reused with arbitrary components, allowing the development of a library of protocols.
Although descriptions of dependability protocols in the literature are relatively simple and concise, incorporating the protocols into an application often requires custom routines which intermix code for the application with that of the protocol. Such intermixing significantly increases the complexity of the code. One way to avoid the resulting complexity is to use a system or language that has been developed to support fault-tolerant computing [5, 13, 11]. However, such support relies on a fixed set of protocols and thereby commits the programmer to defining components with a single failure semantics (by failure seman-
?The research described has been made possible by support from the Office of Naval Research (ONR contract numbers N00014-90-J-1899 and N00014-93-1-0273), by an Incentives for Excellence Award from the Digital Equipment Corporation Faculty Program, and by joint support from the Department of Defense Advanced Research Projects Agency and the National Science Foundation (NSF CCR 90-07195).
tics we mean the expected behavior of an application when a component fails ).
A single failure semantics for all components may be satisfactory in some cases. In many systems, however, performance may be improved by having different failure semantics for different components. For example, providing a few low-level but highly dependable servers may save resources. In such a system, the failure semantics of these servers allow fewer failures that are not masked than the failure semantics of the clients. As Christian's has argued, the use of different failure semantics for different components of a system may allow optimization of both performance and dependability .
Unfortunately, constructing each component with an optimal set of protocols further complicates the code of distributed applications. Moreover, the failure semantics may change if the system requirements change, or if the program is ported to a new platform. To address these problems, we have developed the Dependability Installation Language, DIL, which allows dependability protocols to be developed independently of the applications with which they may be used, i.e., DIL allows specification of dependability protocols to transparently enforce a failure semantics.
Protocol specifications in DIL are implemented by systems architecture facilities which allows communications between application components to be intercepted; protocols are specified in terms of the communication interface of components. Thus the specification of fault-tolerance is orthogonal to that of a component's functionality.
Protocols may be installed on individual components, supporting the customization of failure semantics. The result is a software development environment where programmers may develop applications without the additional complexity of intermixing code for