page 1  (13 pages)
2to next section

Event Displays for Debugging and Managing

Distributed Systems

David Taylor

Department of Computer Science

University of Waterloo

Waterloo, Ontario, Canada N2L 3G1

e-mail: dtaylor@uwaterloo.ca

http://ccnga.uwaterloo.ca/~dtaylor/

Abstract

Distributed-systems researchers at the University of Waterloo have developed tools for capturing events, such as interprocess communication, from distributed applications and displaying them using process-time diagrams. The research work has primarily concerned development of effective techniques for dealing with large collections of processes and large numbers of events, and then exploring these techniques through prototype implementation. These techniques were originally developed primarily for use in debugging applications, but are now being examined for use in monitoring and management of applications in production. This paper describes the techniques and tools which have been developed as well as their intended use in monitoring and management.

1 Introduction

Designing, developing, debugging, and managing distributed applications present significant challenges in addition to those occurring with centralised applications. The interactions among application components must be understood and controlled, as those components execute concurrently with each other, without a global clock to provide a common time reference. Research work on this topic at the University of Waterloo has concentrated on providing assistance in understanding such applications, by capturing relevant events and displaying them. In most cases, the display is based purely on the fundamental partial-order relationship between events, ignoring their real time of occurrence. This avoids anomalies that can occur because of poor clock synchronisation but, more importantly, it helps to elucidate the logical structure of the distributed system, since a partial-order relationship also indicates potential causality.

Much of our work has concerned dealing with large execution histories|many processes and many events. As in many other parts of Computer Science, abstraction is an appropriate method to use in dealing with such situations. For example, when the number of processes in an application becomes large, displaying all the processes simultaneously becomes impractical. It is possible, of course, to provide a scrollbar allowing a user to move back and forth in the set of processes, but the ideal is to have everything relevant on the display screen at once. We have therefore developed an abstraction technique that groups processes into clusters", eliminating currently unwanted detail from the display.