page 2  (33 pages)
to previous section1
3to next section

1 Introduction

1.1 Background

Computer networks and distributed systems are growing very fast. The Internet [C91] is estimated to double in size every 6 to 12 months. Currently it connects about 4,000,000 hosts. Most of them belong to end users, whose involvement in control functions is limited (although they do participate in many distributed computing activities). However, at the rate of 100 to 1000 nodes per router, the number of control nodes is estimated in the tens of thousands. Moreover, the internet is not the only huge and fast growing network; working hard to realize the celebrated information super-highway" are all the main telecommunication companies in the world.

Another characteristic of the emerging networks is their diversity. Numerous manufacturers are involved in producing the hosts, routers and cables for these networks. Even when equipment of the same source is used, there are significant differences in the ages, sizes, speeds, reliability, and many other attributes of the equipment and software used. Moreover, the Internet is not managed by one single authority, but rather by thousands of organizations, governed by very different policies and considerations. Indeed, the parties concerned strive for compatibility, but many differences exist nevertheless.

In such a diverse environment, faults of many different types, leading to information inconsistency, are unavoidable. Indeed, coping with faults, and devising fault-tolerant solutions for various problems, is one of the most active research areas in networking and distributed systems. However, one striking characteristic of this area is that in many of the solutions proposed in the literature, faults are fixed globally, i.e., by algorithms involving the entire system. Clearly, using global measures to detect and correct errors is becoming more and more unrealistic as networks grow, and it is essential to develop scalable fault-handling tools, that is, solutions that can be applied even in large networks. In particular, the cost of such tools is required to grow slower than the system size. It is also required that the non faulty parts of the networks will be able to continue operating even while the faulty parts are recovering. Otherwise no meaningful work can be done.

The approach we propose for tackling this problem is based on the observation that faults are very often of extremely local nature, and involve only a small number of hosts. For example, the famous ARPANET crash [MRR80] was caused by a single node giving all other nodes wrong routing information. (This node old" other nodes that it had distance zero from every node.) Moreover, systems reliability has been increasing, so the number of