A Replicated Monitoring Tool
Darrell D. E. Longy
Computer & Information Sciences
University of California, Santa Cruz
Modeling the reliability of distributed systems requires a good understanding of the reliability of the components. Careful modeling allows highly fault-tolerant distributed applications to be constructed at the least cost. Realistic estimates can be found by measuring the performance of actual systems. An enormous amount of information about system performance can be acquired with no special privileges via the Internet.
A distributed monitoring tool called a tattler is described. The system is composed of a group of tattler processes that monitor a set of selected hosts. The tattlers cooperate to provide a fault-tolerant distributed data base of information about the hosts they monitor. They use weak-consistency replication techniques to ensure their own fault-tolerance and the eventual consistency of the data base that they maintain.
Distributed systems are now pervasive. Few system architects would consider designing a system that could not interact with other systems. Soon it will be rare to find computers that are not connected by a network. With distribution comes an increased incidence of partial failure. Replication of both control and data can be employed to provide systems capable of tolerating partial failures.
To use replication techniques most effectively, it is important to understand the nature of the failures to be masked. Recent studies include analyses of Tandem systems [1, 2] and the IBM/XA system . Research covering heterogeneous systems is less common. Very few such studies have appeared in the open literature, although it is certain that most companies perform reliability studies of their products internally.
An earlier study  that used the Internet to estimate several parameters, including mean-time-to-failure (MTTF)
y Supported in part by a grant from Sun Microsystems, Incorporated, and the University of California MICRO program.
and availability. These estimates were then used to derive an estimate of mean-time-to-repair (MTTR). While this study provided many important results, it suffered from several weaknesses. First, the assumptions made about the distributions that described host failures may not reflect those found in actual systems. Second, the network that connected the polling host (pollster) to the polled hosts (respondents) can affect the statistics by reporting false failures. As a result, the estimates of the parameters may differ significantly from the actual values. In particular, the derived estimate of mean-time-to-repair can be much larger than expected since small errors in the availability estimate that are introduced by the intervening network can have a significant effect.
Estimates of mean-time-to-failure were based on reported up-times and not the actual time of failure. This was the best information available since a host is not in a position to give its failure time as it goes down. As a result, there was a bias towards more reliable hosts which means that the estimate of MTTF is larger than the true value.
Similarly, availability was estimated by the fraction of hosts that were reachable by the pollster. To ensure that only hosts capable of answering were queried, an initial poll was made and only hosts that answered this poll were counted in the availability census. Unfortunately, there are a significant number of network segments separating the pollster from most respondents and so a network failure may be misinterpreted as a host failure.
The most direct way of determining statistics such as MTTF and availability is through direct measurement. A fault-tolerant monitor is being developed that can be placed at strategic locations around the Internet. Instances of the monitor will be placed to minimize the amount of shared network so that a failure of a router or a link will be unlikely to disable more than one monitor. They replicate their statistics so that even the permanent failure of one monitor will not cause a significant loss of information.
These monitors are called tattlers since they periodically inquire about other hosts and then tattle to each other about what they learn. The distributed data base is managed using an epidemic replication protocol [5, 6, 7]. Such protocols provide weak consistency guarantees, which are sufficient