Characterizing Communication Interactions of Parallel and Sequential
Jobs on Networks of Workstations ?
Xing Du, Yingfei Dong, and Xiaodong Zhang
High Performance Computing and Software Laboratory
Division of Computer Science
University of Texas at San Antonio
San Antonio, Texas 78249, USA
In this paper, we examine and characterize effects of communication interactions of parallel and sequential jobs on a nondedicated network of workstations. We quantitatively model the interaction process and measure communication delays of parallel jobs caused by the interaction. Measurement results on an ATM network of workstations support our analytical models.
The wide availability of workstations and improving speed of networks have made networks of workstations (NOWs) become important platforms for parallel computation. To address the communication issue on NOWs, current research focuses on two main directions: to provide performance insights and to identify overhead sources of communications   ; and to propose and implement models to reduce the communication overhead   .
In addition to relatively low bandwidths of network communications, there is another issue that affects the performance: communication interactions. A NOW is usually not dedicated for parallel computation. Local users may use their workstations to run their sequential jobs and to access resources across networks.
We examine effects of communication interactions of parallel and sequential jobs on a nondedicated network of workstations. TCP/IP is a popular communication protocol widely used. Based on this protocol, we quantitatively model the interaction and verify the model by measuring communication delays of parallel jobs on an ATM network of workstations. The experimental results support our analytical models.
A NOW can be abstracted as a connected graph HN(W;Net), where
ffl W = fW1; W2; :::; WNg is a set of workstations (N is the number of workstations).
ffl Net is an interconnection network.
?This work is supported in part by the National Science Foundation under grants CCR-9102854 and CCR-9400719, by the Air Force Office of Scientific Research under grant AFOSR-95-1-0215, and by the Office of Naval Research under grant ONR-95-1-1239. Xing Du is on leave from Computer Science Department, Nanjing University, Nanjing, P.R.China.
If it consists of a set of identical workstations, a NOW is homogeneous, otherwise it is heterogeneous. In order to focus on communication issues, we choose a homogeneous system as our target NOW, and assume that a switch-based network is used. This is a representative NOW environment.
The parallel jobs we consider in this paper follow the bulk synchronous model. A job consists of one task per workstation on a fixed number of workstations throughout the execution. The size of each task is similar. The task completes after a number of iterations. In each iteration, phases of local computation alternate with phases of communications and synchronizations. The basic structure of such a parallel job is as follows:
simultaneous tasks for local computation;
communications and/or synchronization;
Three representative communication patterns in parallel programs
are considered: (1) some-to-one: Some tasks send messages
to one specific task; (2) one-to-some: A task sends messages
to some other tasks; and (3) each-to-some: Each task
sends messages to a set of tasks.
We define three parameters to model the traffic flows and their effects on each other.
1. Message transmission rate (MTR)
We define the message transmission rate (MTR) as the average number of messages of one traffic flow transmitted per second over a network link. Suppose the network bandwidth is B Mbps, and the average message size is <= bytes. The maximum MTR ( ) over the network is calculated by:
<= ? 8 :
2. Effective communication ratio (ffi)
The effective communication ratio, denoted as ffi, is defined as the ratio between the effective number of messages transmitted when the congestion occurs and the maximum number of messages the physical link can transmit. So the MTR becomes ffi when the congestion occurs. Obviously, ffi is the function of the number of traffic flows.
3. Communication delay ratio (CDR)
Assume t is the communication time of a parallel job without interaction with local user communications, and T is