Appears in Supercomputing ?92, November 1992, Minneapolis, MN
Optimal Tracing and Replay for Debugging Message-Passing
Robert H. B. Netzer Barton P. Miller
Dept. of Computer Science Computer Sciences Dept.
Brown University University of Wisconsin-Madison
Box 1910 1210 W. Dayton St.
Providence, RI 02912 Madison, WI 53706
A common debugging strategy involves reexecuting a program (on a given input) over and over, each time gaining more information about bugs. Such techniques can fail on message-passing parallel programs. Because of variations in message latencies and process scheduling, different runs on the given input may produce different results. This non-repeatability is a serious debugging problem, since an execution cannot always be reproduced to track down bugs. This paper presents a technique for tracing and replaying message-passing programs for debugging. Our technique is optimal in the common case and has good performance in the worst case. By making run-time tracing decisions, we trace only a fraction of the total number of messages, gaining two orders of magnitude reduction over traditional techniques which trace every message. Experiments indicate that only 1% of the messages often need be traced. These traces are sufficient to provide replay, allowing an execution to be reproduced any number of times for debugging. Our work is novel in that we use run-time decisions to detect and trace only those messages that introduce nondeterminacy. With our strategy, large reductions in trace size allow long-running programs to be replayed that were previously unmanageable. In addition, the reduced tracing requirements alleviate tracing bottlenecks, allowing executions to be debugged with substantially lower execution-time overhead.
Parallel programs can be nondeterministic. When processes communicate by passing messages, variations in scheduling and message latencies can cause two executions of the same program (on the same input) to produce different results. Such nondeterminacy may be intended, but it can cause serious problems while debugging: subsequent executions of the program may not reproduce the hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
This work was supported in part by National Science Foundation grants CCR-8815928 and CCR-9100968, Office of Naval Research grant N00014-89-J-1222, and a grant from Sequent Computer Systems Inc.
original bug. This non-repeatability makes it difficult to use traditional sequential debugging techniques that require repeated execution. Therefore, a mechanism for tracing and then replaying a program execution is an essential part of parallel program debugging. A critical cost in such a mechanism is the cost of tracing. In this paper we present a trace and replay mechanism for message-passing parallel programs that is optimal in the common case and has good performance in the worst case. We significantly reduce the number of messages that need be traced, improving by up to two orders of magnitude earlier techniques that trace every message. With such a reduction, long-running programs can now be replayed that could not have been previously replayed. Our tracing works by making decisions at run-time, tracing often the minimal set of messages needed to provide replay. Experiments show that only 1% of the messages usually need be traced.
In a trace and replay scheme, the order in which messages are delivered (but not their contents) is first traced during execution. These traces are then used during re-execution to force each message to be delivered to the same operation as during the traced execution. Tracing the original execution is necessary because some messages may race with others, introducing nondeterminacy into the execution. Two messages race if they are simultaneously in transit and either could arrive first at some receive operation. If the original order in which racing messages are delivered is not recorded, their order cannot always be reproduced during replay. However, by tracing the original message deliveries and then forcing them to occur during replay, the computation and all its messages will be exactly reproduced?. An erroneous execution can then be repeatedly replayed to analyze the execution more carefully and gain information about bugs.
Our main result is a tracing strategy using on-thefly
detection of racing messages that is optimal in most
? Interactions with the external environment must also be reproduced (such as return values from system calls). However, these interactions must be reproduced to replay sequential programs as well.