page 1  (6 pages)
2to next section

DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT
Section I. Administrative

Abstract Proposal to the Defense Advanced Research Projects Agency (DARPA) BAA 98-02: Next Generation Internet(NGI): Network Engineering
Technical Topic Area: Network Monitoring Architectures

Self-Con?guring Active Network Monitoring (SCAN)

by

INFORMATION SCIENCES INSTITUTE
UNIVERSITY OF SOUTHERN CALIFORNIA
LOS ANGELES, CALIFORNIA 90089-1147

Year 1: $xx Year 2: $yy Year 3: $zz Total: $ww

Ramesh Govindan
Cengiz Alaettino ffglu
Deborah Estrin
Mark Handley

PROPOSAL ABSTRACT
ISI Proposal 97-ISI-XX

ISI Technical Contact: Ramesh Govindan
GOVINDAN@ISI.EDU
Phone: (310) 822-1511, Fax: (310) 823-6714

ISI Administrative Contact: Margie Schroeder
MARGIEJ@ISI.EDU
Phone: (310) 822-1511, Fax: (310) 823-6714

Type of Business: Other Educational

Principal Investigators:
Ramesh Govindan, Research Assistant Professor, USC/ISI
Cengiz Alaettino ffglu, Research Assistant Professor, USC/ISI
Deborah Estrin, Associate Professor, USC

Lloyd Armstrong Jr.
Provost and Senior Vice President for Academic Affairs
University of Southern California

DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT
Section II: Proposal Abstract

A. Innovative Claims

Future internets will provide differentiated communications services for life-critical distributed applications. Continuous network monitoring is imperative for such internets; SCAN, a scalable and robust distributed monitoring infrastructure, can provide this functionality. The following features distinguish SCAN from existing network monitoring systems:

ffl Unlike today's localized network management systems, SCAN will enable users at any network location to see continuously updated performance views of the entire internet. Such abstracted network views will allow administrators to dynamically re-engineer internet segments for sustained near-optimal performance.

ffl Today's monitoring systems only provide device-level status information. SCAN will also display application-perceived performance, which is useful for visualizing component interactions in complex large-scale distributed applications and services.

ffl SCAN users will be able to correlate application-perceived performance with network-level views. Together with the ability to retrieve recent measurement history, such correlation will help pinpoint hotspot origins or track anomalous application behavior.

A collection of distributed surveyors will form the core of SCAN. These software systems will collect measurements from network nodes within their vicinity, and transmit them to users. Two novel techniques will enable SCAN to deal with the expected volatility and scale of future internets:

Organic Self-Con?guration Like cooperating organisms, SCAN surveyors will continuously self-con?gure their measurement collection vicinities in response to failed peers or network partitions. This elegant mechanism will ensure complete internet coverage even in the face of frequent large-scale network changes.

Active Instrumentation Dynamic instrumentation will allow SCAN to scale well to large internets with many concurrent monitoring sessions. Users can instantiate active code fragments that modify the measurement collection behavior of SCAN surveyors in situation-speci?c ways. Since surveyors will re-con?gure themselves frequently, these code fragments must be frequently refreshed.

The ?rst real-time diagnostic infrastructure for internets, SCAN will form an integral part of an internet management framework that also includes mechanisms for rapidly re-engineering large internet segments. SCAN is a driving application for important technical advances in self-organizing systems and active technology. But, SCAN's lasting contribution will be the greatly increased diagnosability and robustness of future internets.

PROPOSAL ABSTRACT Page 1

DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT
B. Technical Rationale

Motivation

In the future, rapidly assembled internets will ?nd increasing civilian and military use. For example, large quicklyconstituted internets will help ef?ciently co-ordinate disaster response. Such internets will be expected to function under extreme physical conditions. These ultra-resilient internets (or ultranets) will metamorphose frequently following large-scale network outages or phase changes in the response operation. They will run an ever-changing mix of distributed applications, ranging from distributed scenario simulations to concurrent audio and video conferences.

Because they perform life-critical functions, ultranet applications will expect sustained near-optimal performance from the underlying internet. It will be dif?cult to statically engineer ultranets for such performance. The mix of networks that constitutes an ultranet may change unpredictably. Besides, many different backup con?gurations will be exercised in an ultranet?pre-engineering the entire network for all likely con?gurations may be intractable. Ultranets will, therefore, be re-engineered on-the-?y. To dynamically re-engineer an ultranet, a continuously updated global view of network performance must be available at any location. Large-scale network monitoring is necessary to provide this view.

Ultranets represent a radical design target for SCAN, a self-con?guring distributed internet monitoring infrastructure that is the subject of this proposal. SCAN's architecture sharply deviates from today's centralized network monitoring systems, which are neither robust nor scalable. In SCAN, a collection of distributed surveyors will each gather measurements from nodes within its vicinity. Software monitoring tools, that we call netscopes, will continuously receive measurements from one or more surveyors, then display views of application or network performance.

SCAN Netscopes

Some netscopes will be speci?c to a single application. For example, a netscope might monitor a hierarchy of servers that implements a distributed simulation. Other netscopes will be devoted to speci?c system components. Examples include netscopes that monitor virtual overlay networks, multicast distribution trees, or name services. Users will be able to run netscopes on any node. Several instances of a netscope may execute concurrently within the network. Several netscopes may execute on the same node.
Two netscopes, the mapper and the tracer, illustrate SCAN's potential:

The Mapper will display the entire physical network at various levels of abstraction. It will portray, on demand, performance characteristics?link utilizations, router queueing delays etc.?of current and recently-used endto-end paths. The mapper can also depict aggregate performance measures of sections of the topology. Such a tool will provide more comprehensive views of network performance than today's path analysis tools ping, traceroute and pathchar.

The Tracer will display the performance history of large-scale distributed applications that use multicast for group communication. Users can visualize group membership dynamics, temporal shifts in group communication patterns, and the variation in pairwise communication performance. More than today's multicast monitoring and debugging tools rtpmon and mtrace, the tracer will facilitate rapid program or session fault isolation.

Users will compose two or more netscopes to spatially or temporally correlate two different performance views of the internet. For example, a user might superimpose application performance obtained from the tracer with linkspeci?c loss and delay characteristics displayed by the mapper.

SCAN Architecture

To support netscopes, SCAN must provide ef?cient mechanisms for collecting performance and usage measurements, and transmitting these to netscopes. In SCAN, these mechanisms will be implemented by a set of distributed

PROPOSAL ABSTRACT Page 2

DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT

Surveyor

Netscope

Surveyor's Range

Figure 1: SCAN Structure: SCAN consists of a collection of self-con?guring surveyors that robustly ?covers? the entire internet. Surveyors are also extensible; users may dynamically instrument them to change their data collection behavior in scenario-speci?c ways. Surveyors deliver performance measurements to netscopes.

software components called surveyors (Figure 1). Surveyors will collect complete performance data from nodes within their vicinity; to do this, they will leverage current research1 on statistical network probing and measurement ?ltering. Together, surveyors will conspire to ensure coverage of the entire internet. Each netscope will continuously receive measurement streams from one or more surveyors. From these streams, it can construct abstracted images of application performance or network status.

This structure is motivated by the need for scalable collection and transmission of performance measurements. At what scales do we expect SCAN to operate? Some netscopes, such as the mapper, may need to display performance measures from each node of the internet. Million-node internets are not inconceivable, especially as smaller devices attach to the network. Netscopes will also need to provide integrated views depicting not just device status, but also the performance of tens or hundreds application-level processes or systems software components per node.

How do surveyors enable SCAN to scale to large internets? In the absence of surveyors, network nodes would have to collect all possible measurements of interest from every software or hardware component, then transmit all these measurements to all currently executing netscopes. In SCAN, netscopes will dynamically instrument one or more surveyors to collect only application-speci?c measures of interest, or perform situation-speci?c computation. For example, a mapper instance could instrument surveyors to collect link utilization measures from a section of the internet. Similarly, a tracer instance could dynamically alter the amount of measurement history that surveyors maintain on its behalf.

Future internets will metamorphose frequently, and surveyors must be resilient to peer failures or network partitions. So that netscopes like the mapper can continuously display accurate and complete internet performance views, surveyors must automatically adjust their measurement collection behavior in response to failures.

Research Directions

In this section, we discuss these two architectural foundations of SCAN?surveyor self-con?guration and extensibility. We also brie?y discuss two other research directions that will arise in the design of SCAN, measurement transport and visualization.

Surveyor Self-Con?guration

In SCAN, each surveyor will collect performance measurements from nodes within its vicinity. These nodes constitute the surveyor's range (Figure 1). How is a surveyor's range determined? With each node, a surveyor will associate a distance which re?ects the cost of collecting measurements from that node. A surveyor's range, then, is the set of nodes which are closer to that surveyor than any other.

1Work by V. Paxson (LBL) and M. Garrett (Bellcore)

PROPOSAL ABSTRACT Page 3

DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT
Especially in highly dynamic internets, manually con?guring surveyor ranges can often result in erroneous or incomplete coverage. Instead, SCAN surveyors will collectively self-con?gure their ranges. To do this, each surveyor will periodically expand its current range boundary by a small increment; all surveyors will start with empty ranges. Each surveyor will then announce, to all other surveyors, the set of nodes within its range, and its own distances to these nodes. Based on information received from its peers, each surveyor will independently decide which nodes lie within its range, contracting its own range boundary if necessary. Surveyors will thus iteratively converge on complete internet coverage.

This loosely-coupled protocol for eventual consensus among surveyors will also allow elegant recovery from failed surveyors. When a surveyor fails, other neighboring surveyors will gradually discover, and incorporate into their own ranges, nodes from the failed surveyor's range. Similarly, when a failed surveyor recovers, it will slowly expand its own range until cost-effective internet coverage is achieved. With this protocol, surveyors will also recover from internet partitions effectively and quickly.

Several research issues need to be addressed in the design of this protocol. First, a surveyor may describe its range in many ways, from enumerating all nodes within its range to listing only nodes on the range boundary. The latter alternative scales better to larger internets, but can delay range convergence. Second, the choice of surveyor-tonode distance functions can affect the dynamics of the protocol. Distance functions also determine the distribution of measurement load across surveyors, and the stability of the protocol. Third, the frequency of range description exchange and expansion also affects recovery from surveyor failure. In designing the expansion and contraction mechanisms, care must be taken to avoid persistent oscillations. Finally, scaling the protocol to very large internets can be accomplished by mechanisms that cluster surveyors or scope-limit range distribution. Such mechanisms can adversely impact range convergence after large-scale network outages.

Surveyor Extensibility

In SCAN, netscopes will dynamically instrument surveyors to collect and process measurements in applicationspeci?c ways. Thus, netscopes will only receive performance measurements of interest to them. Dynamic instrumentation will also allow SCAN to ?exibly incorporate new applications or monitoring techniques.

Of the several techniques available for dynamically extending surveyor functionality, active code distribution is most appropriate for SCAN. With this, each netscope will instantiate code fragments at one or more surveyors. These code fragments will gather relevant measurements from nodes within the surveyor's range, process the measurements, and transmit them to the netscope. To cope with surveyor failure, netscopes will use soft-state refresh techniques for instantiated code fragments.

While much attention has been focused on remote code execution, instrumenting internet surveyors introduces several new research directions. First, to be able to temporally correlate measurements, netscopes will need synchronized instantiation of code fragments in surveyors. SCAN will contain synchronized code instantiation mechanisms that do not require classical distributed consensus: most consensus algorithms rely on tightly-coupled interactions between participants, an approach that performs poorly in large internets. Second, the design of code refresh techniques is complicated by the need to synchronize code instantiation, and by the need to cope with changing surveyor ranges. Finally, in large internets the number of code fragments instantiated at a surveyor may be unacceptably large. To reduce code sizes, surveyors will dynamically outline code; that is, surveyors will attempt to recognize, and extract modules common to two or more instantiated code fragments.

Other Research

Netscopes receive continuous measurement streams from surveyors. This many-to-many communication is most ef?ciently implemented using network-layer multicast. Reliable multicast protocols for delivering measurements will encode related streams?e.g., the same metric sampled at different rates?as layers. Each layer may have different reliability requirements; frequently sampled streams may be moderately loss-tolerant, while time averages

PROPOSAL ABSTRACT Page 4

DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT
may not.

From the received measurement streams, netscopes display correlated, continuously changing performance views. Even though each netscope is dedicated to a different measurement task, all netscopes will share sophisticated visualization techniques that enable rapid and accurate fault diagnosis and isolation. SCAN presents several unique opportunities for exploring novel visualization designs. For example, SCAN performance views may display an individual router interface, a distribution tree spanning a subset of network nodes, or the entire internet. Netscopes will contain new visual metaphors and overlay techniques that allow users to seamlessly traverse abstracted views of these entities. Similarly, performance views of future internets will be highly dynamic. Netscopes will implement uncluttered fading techniques for displaying stale or historic measurement information. Finally, users will need to dynamically alter representations of performance views. Netscopes will have a vast array of user-selectable visualization techniques at their disposal. In exploring the visualization aspects of netscopes, we propose to collaborate with researchers at Lucent's Bell Laboratories. The collaboration will leverage Lucent's large portfolio of innovative visual applications that embody fundamental new research insights into presenting information.

Constructive Plan

Design Several performance criteria will drive the design of SCAN components: the rate of coverage convergence after network outages, latencies imposed by the instrumentation and transport protocols, the degree to which SCAN itself perturbs the internet, and so on.

Evaluation We will thoroughly quantify, by simulation, SCAN's performance over a wide range of internet topologies and stability characteristics. We intend to use the VINT simulation tools for this purpose.

Implementation We will prototype the surveyor component of SCAN and the two netscopes?the mapper and the tracer?described above. We intend to deploy this SCAN prototype on an experimental research testbed such as CAIRN.

Related Work and Conclusions

SCAN complements, and will directly utilize, research that is de?ning network performance metrics and statistical measurement interpretation techniques. SCAN differs signi?cantly from existing commercial network monitoring systems. To ensure reliable delivery of measurement information, some of these systems require dedicated out-ofband communication channels. In others, measurement data is collected and displayed at a centralized monitor. While these systems are acceptable for small campuses, they clearly do not meet the scalability and robustness requirements of large and diverse internetworks.

Although we have focused on its performance monitoring capabilities, SCAN will be ?exible enough to accommodate security monitoring needs as well. For example, netscopes may easily instrument one or more surveyors to detect and track intrusion attempts. SCAN will be one component of a larger infrastructure for network management; a mechanism for safely effecting large-scale network re-engineering is a complementary component. Practical experience with large-scale self-con?guration and dynamic instrumentation will be one notable contribution of SCAN. However, its lasting contribution will be the increased diagnosability and manageability of future internets.

PROPOSAL ABSTRACT Page 5