page 1  (5 pages)
2to next section

Applying Scalable Read Consistency

to Data Replication

THD-BS-1995-01

January 1995

C. Liebig, H.-H. Pagnia, F.Schwappacher, O. Theel

Department of Computer Science

University of Darmstadt

Technischer Bericht

?

Technische Hochschule Darmstadt

Fachbereich Informatik

To appear in:
13th IASTED International Conference
Applied Informatics
February 1995

Applying Scalable Read Consistency to Data Replication

C. Liebig, H.-H. Pagnia-Koch, F. Schwappacher O. Theel

Dept. of Computer Science Dept. of Computer Science

Technical University of Darmstadt University of California, Riverside

Germany U.S.A.

Abstract

By allowing read operations to succeed even when only an insufficient number of replicas is accessible, the read availability can be significantly increased. This paper presents such a weak consistency approach for replicated data in a networking environment. A method for estimating the degree of correctness" of those modified read operations is proposed. For demonstration purposes, we illustrate the advantages of our approach by an example scenario and apply the method to the Majority Consensus Protocol. However, the approach is applicable to a large number of even more sophisticated protocols.

Keywords
Distributed Systems, Software Fault Tolerance, Data Replication, Highly Available Operations, Weak Consistency

Introduction

People who have suffered severe burns must immediately undergo special medical treatment. Since the required medical equipment is very expensive, only a few hospitals are equipped with proper burn units. When an accident has happened, the paramedics on site must instantly be able to derive which hospital can accommodate the burn victims. A straightforward method for the parademics to obtain such information is to sequentially contact the appropriately equipped hospitals. Unfortunately by doing this, they waste precious time with devastating impacts on the victims' state of health. By storing the required information about available capacities in a computer at each hospital, it is possible to let the paramedics retrieve the information for all hospitals directly and instantaneously.

Consider a hospital computer network with five locations in Aachen, Berlin, Chemnitz, Darmstadt, and Erlangen as shown in Figure 1. Each hospital computer manages one replica of an object containing information about all available beds for burn victims. Read operations provided by the hospital computer system therefore retrieve the number of patients that can be taken care of at each hospital. Write operations are used to pre-allocate medical

Berlin

NETWORK

Erlangen
Darmstadt

Aachen
Chemnitz

Figure 1: Five hospitals connected by a computer network

treatment for a certain number of victims at given hospitals. Those operations can be remotely triggered and executed at all sites of the network. The applicability of this approach requires a replica control algorithm in order to keep the replicated object (consisting of the five replicas) in a consistent state. Furthermore, the hospital computer system and the read operation must be highly available: as already pointed out, the paramedics do not have much time to decide where to transport the patients to.

Voting strategies, like Majority Consensus [1] or Hierarchical Quorum Consensus [2], fulfill both requirements identified above: consistency and availability. However, there is still the possibility that read operations cannot be executed since computers or network links may fail. In such a critical situation, the rescue crew must wait until the system has recovered from failure, or make decisions without having further information. The basic idea of our approach is to provide the paramedics in this critical situation with as much information as currently available. Although not all replicas which are required to execute the read operation with 100% consistency might be available, there is still a chance that at least one replica with the most recent data is among those currently accessible. Since there is othing to loose", tolerating such an incomplete read operation makes sense in the scenario given above. Furthermore, at-

taching some sort of consistency estimation to the result of the read operation additionally supports the decision process. A typical answer for a read operation is e.g.: The Elisabethenstift Hospital in Darmstadt can accommodate five patients. This value is current with a probability of 95%". The reader may agree that, although the answer might be outdated with a probability of 5%, making a decision based on this result is definitely better than not receiving any information at all. We call the consistency estimation level of confidence (LOC).

In the next section, we show how this level of confidence is generally derived. Additionally, we demonstrate the impact of our technique on the example scenario, assuming that Majority Consensus is used for replica control. Subsequently, we relate our approach to previous work. A short summary and a description of our ongoing work concludes the paper.

The Incomplete Read Operation Ap-

proach

Let R be a replicated object with the replicas r1; : : : ; rn managed by a voting strategy [3]. R can be denoted as R := fr1; : : : ; rng.

In order to maintain the consistency of R, read (write) requests must lock a predefined subset of the replicas, called read (write) quorum. A version number which is attached to each replica, is increased whenever the replica participates in a write operation. Replicas with the highest version number are called up-to-date (or current) replicas. They contain the current value of the replicated object.

By relaxing the consistency criterion as used in the original protocol for read operations, it is possible to perform read operations which access only a subset of a complete" read quorum. We call such a subset of replicas an incomplete read quorum.

In the following we illustrate the conventions given above. Assume, R:=fA,B,C,D,Eg is the replicated object for managing the available capacities at all five hospitals with one replica located at each hospital computer, e.g. A in Aachen, B in Berlin etc. We assume furthermore that R is managed by the Majority Consensus replica control algorithm. This means that either a read or a write quorum must consist of exactly three replicas. Figure 2 shows a snapshot of the replicated object, i.e. the five replicas with attached version numbers. The highest version number is 7 and the latest write quorum was fB,D,Eg. A possible read quorum is fA,C,Eg containing the up-to-date replica E. fA,Cg or fA,Eg are examples for incomplete read quorums. fA,Cg does not contain an up-to-date replica. This shows how reading the replicas of an incomplete read quorum can lead to an incorrect result. If however, incom-

NETWORK
version 5

version 7

version 7

version 7
version 6C
D E

B

A

Figure 2: The five replicas of R and their version numbers

plete read operations are prohibited, then a read operation accessing the replicas fA,Eg must be rejected, even though it contains the up-to-date replica E.

To determine the probability of the event he read quorum contains an up-to-date replica", we consider all write quorums that have possibly been updated by the most recent write operation. Thus, the level of confidence estimates the probability of the event he latest write quorum intersects with the currently used read quorum".

Let erq be an incomplete read quorum and let v(ri) be the version number of replica ri 2 R. We define brq to be the subset of erq consisting of replicas with the version number max no, where max no := max (v(ri) j ri 2 erq):

brq := frj j rj 2 erq ^ v(rj) = max nog (1)

For example, if erq=fA,Eg, v(A)=5 and v(E)=7, then max no=7 and brq=fEg.

A read quorum is called up-to-date, if it contains an up-to-date replica. If the latest write operation changed the version number to max no then

ffl erq is up-to-date,

ffl all replicas in brq participated in the latest write operation,

ffl the remaining replicas in erq n brq did not participate in the latest write operation.

Let WQ be the set of all minimal write quorums. A minimal write quorum is a write quorum which is correct for a given voting strategy but does not contain any other correct write quorum. We now denote the set of write quorums which { if chosen to execute the write operation { lead to the situation that erq is up-to-date:

d
WQ( erq) := fwq j wq2 WQ ^ brq ? wq ^

( erq n brq) wq = ffg (2)

Applied to our sample erq, this means that we must select quorums containing replica E but not A. Thus, d
WQ(fA,Eg)=ffB,C,Eg, fB,D,Eg, fC,D,Egg.

Next, we consider the situation that max no is not the maximal version number of all replicas, i.e. no replica in erq reflects the most recent update. We can observe that

ffl erq is not up-to-date,

ffl brq is not up-to-date,

ffl no replica in erq participated in the latest write operation.

The set of write quorums which { if chosen to execute the write operation { leads to the situation that erq is not up-to-date, is defined by

WQ( erq) := fwq j wq2 WQ ^ erq wq = ffg: (3)

In our example, we can derive from Figure 2 that fB,D,Eg has been the latest write quorum and that the replicas with version number 7 are up-to-date. However, for a read operation accessing erq=fA,Eg another potential latest write quorum is fB,C,Dg; it may contain replicas with a version number greater than 7.

To obtain the level of confidence, we must calculate the probability of the event he latest write operation was based on a write quorum in d WQ".
We assume pri 2 [ ; 1] to be the probability that
replica ri is available. The set of available replicas is referred to as system state. The probability for the system to be in state Z is

p(Z) :=
Y

ri2Z
pri ?
Y

rj =2Z
(1 ? prj ): (4)

For the example, we assume pri = :9 for all replicas
of R. Then, the probability that the system is in state Z=fA,B,C,D,Eg calculates to p(Z) = :95 = :59049.
In system state Z , write operations can only be performed with quorums solely containing available replicas. We define the function reduce to select these quorums from a given set of quorums Q :

reduce(Q; Z) := fwq j wq 2 Q ^ ri 2 wq ^ ri 2 Zg (5)
Suppose that the system was in state Z when the latest write operation took place. From the version numbers of the replicas in erq, we can derive that the set of possible write quorums was reduce( WQ( erq) [ d
WQ( erq); Z). We assume that an
arbitrary write quorum was chosen from the set of all possible write quorums for the update operation. The probability that this quorum is in d WQ( erq) can
be calculated as follows:

A( erq; Z) = jreduce( d
WQ( erq); Z)j

jreduce( WQ( erq) [ d
WQ( erq); Z)j (6)

Applied to our example and the given system state Z=fA,B,C,D,Eg, A(fA;Eg; Z) computes to

A(fA;Eg; Z) := jffB;C;Eg;fB;D;Eg;fC;D;Eggj jffB;C;Dg;fB;C;Eg;fB;D;Eg;fC;D;Eggj
= 3=4

Finally, we accumulate the conditional probabilities A( erq; Z) given Z for all system states Z that permit a write operation. Because the latest write operation must have accessed a quorum in WQ( erq) [ d WQ( erq),
we obtain the relevant set of system states

S( erq) := f Z j Z2 P(R) ^ (7)

reduce( WQ( erq) [ d
WQ( erq); Z) <> g:

(P(R) denotes the power set of R.) We derive the level of confidence for a particular read quorum erq as follows:

LOC( erq) :=
X

Z2S(erq)

(A ( erq; Z) ? p(Z)) ? 1
X

Z2S(erq)

p(Z)

(8)
As shown above, the set of possible write quorums in the example is ffB,C,Dg, fB,C,Eg, fB,D,Eg, fC,D,Egg. Therefore, S(fA,Eg)=ffB,C,Dg, fB,C,Eg, fB,D,Eg, fC,D,Eg, fB,C,D,Eg, fA,B,C,Dg, fA,B,C,Eg, fA,B,D,Eg, fA,C,D,Eg, fA,B,C,D,Egg. For calculating the level of confidence using equation (8), we obtain LOC (fA,Eg)=75%.

Note that the level of confidence of a complete read quorum, e.g. fA,B,Eg, is always 100%. For our example, it can be shown that the LOC is 60% if a single replica is read. For reading two replicas the LOC is 69.8% if both have equal version numbers and 75% otherwise. If we tolerate read quorums with a level of confidence of at least 75% we can increase the read availability from 99:1% to 99:95%.

Related Work

Replication is a well-known technique for increasing data availability and decreasing access costs in distributed systems. In order to hide the complexity of managing the replicas from the user, even in the presence of computer and network failures, replica control algorithms are needed. Many replica control algorithms have been proposed, favoring either low access costs or high data availability. In [4] these algorithms are classified as optimistic or pessimistic replication strategies.

Assuming that computer or link failures are unlikely to occur, optimistic strategies [5, 6, 7] allow concurrent read and write access operations. As a consequence, replicas can become mutually inconsistent and must be repaired" later; either by a special

merge procedure or by rolling back the conflicting access operations. In the worst case, the latter possibility can result in cascading rollbacks { a highly undesirable situation. Some replication schemes [8, 9] provide weak consistency, allowing write operations not to be propagated to all replicas at once. However, a severe drawback of these schemes is that consistency cannot be guaranteed.

Contrary to that, pessimistic replication strategies [1, 10, 11, 12, 13, 14, 15] always keep the replicated objects consistent by prohibiting concurrent write operations as well as simultaneous executions of read and write operations. Therefore, operation requests which might lead to future inconsistencies will be eventually rejected, resulting in decreased operation availabilities.

Conclusions and Future Work

We presented a method which is suited to increase the read availability of pessimistic replication strategies by weakening the read consistency. If at least one replica can be accessed by a read operation, some { although not necessarily the correct { information can be returned to the client. As a measure to estimate the reliability of this information we have introduced a so-called level of confidence".

We have applied our approach to a scenario where medical resources are managed by a distributed computer system. In this application scenario, high read availability as well as strong write consistency are essential. These requirements hold true for a large class of applications: e.g. electronic warehouse catalogues or stock market management.

As a general rule, the proposed method performs very well when the replicated object changes only infrequently and if two subsequent versions are expected to differ only slightly. In this case, reading an old version of the replicated object can still provide valuable information.

Based on a general specification model for voting protocols described in [3], we are currently analyzing different classes of voting strategies. Examples are Hierarchical Quorum Consensus [2], the Logarithmic Protocol [12], the Tree Quorum Protocol [13], the Grid Protocol [10], and the Hierarchical Grid Protocol [14]. Furthermore, we vary different parameters, like the number of replicas and the site availabilities. As a first result, we could identify voting strategies with a level of confidence higher than 90% and significantly increased read availability.

References

[1] R. H. Thomas. A Majority Consensus Approach to Concurrency Control for Multiple Copy Databases.

ACM Transactions on Database Systems, 4(2):180{ 207, 1979.

[2] A. Kumar. Hierarchical Quorum Consensus: A New Algorithm for Managing Replicated Data. IEEE Transactions on Computers, 40(9):996{1004, 1991.

[3] O. Theel. General Structured Voting: A Flexible Framework for Modelling Cooperations. In Proc. of the 13th International Conference on Distributed Computing Systems, Pittsburgh, PA, pages 227{236. IEEE, May 1993.

[4] S. B. Davidson, H. Garcia-Molina, and D. Skeen. Consistency in Partitioned Networks. Computing Surveys, 17(3), 1985.

[5] S. B. Davidson. Optimism and Consistency in Partitioned Distributed Database Systems. ACM Transactions on Database Systems, 9(3):456{481, 1984.

[6] H. Garcia-Molina, T. Allen, B. Blaustein, et al. Data{Patch: Integrating Inconsistent Copies of a Database after a Partition. In Proc. of the 3rd Symposium on Reliability in Distributed Software and Database Systems, pages 38{44. IEEE, 1983.

[7] D. S. Parker, G. J. Popek, and G. Rudisin. Detection of Mutual Inconsistency in Distributed Systems. IEEE Transactions on Software Engineering, 9(3), 1983.

[8] J. E. Allchin. A Suite of Robust Algorithms For Maintaining Replicated Data Using Weak Consistency Conditions. In Proc. of the 3rd Symposium on Reliability in Distributed Software and Database Systems, pages 47{56. IEEE, 1983.

[9] R. A. Golding. A Weak-Consistency Architecture for Distributed Information Services. Technical Report UCSC{CRL{92{31, Dept. of Computer and Information Sciences, University of California, Santa Cruz, Santa Cruz, CA 95064, July 1992.

[10] S. Y. Cheung, M. Ahamad, and M. H. Ammar. The Grid Protocol: A High Performance Scheme for Maintaining Replicated Data. In Proc. of the 6th International Conference on Data Engineering, pages 438{445, February 1990.

[11] S. Jajodia and D. Mutchler. Dynamic Voting. In Proc. of the ACM SIGMOD, pages 227{238, 1987.

[12] H.-H. Koch. Exploiting Logical Tree Structures for Data Replication Schemes. In Proc. of the 23rd International Conference of Fault Tolerant Computing Systems, pages 382{391. IEEE, June 1993.

[13] D. Agrawal and A. El Abbadi. The Tree Quorum Protocol: An Efficient Approach for Managing Replicated Data. In Proc. of the 16th VLDB Conference, pages 243{254, 1990.

[14] A. Kumar and S. Y. Cheung. A High Available pN Hierarchical Grid Algorithm for Replicated Data. Information Processing Letters, 40(0):311{ 316, 1991.

[15] O. Theel and H. Pagnia-Koch. General Design of Grid-Based Data Replication Schemes Using Graphs and a Few Rules. In Proc. of the 15th International Conference on Distributed Computing Systems. IEEE, May 1995. (to be published).