
Experiments With a Fully Distributed Chess
Program ?
R.Feldmann P.Mysliwietz B.Monien
Department of Mathematics and Computer Science
University of Paderborn
Germany
Extended Abstract
ABSTRACT
We present some of the recent results of the distributed chess program ZUGZWANG. As pointed
out earlier in [FMM90] we use an algorithm with very good load balancing properties. Our
algorithm enables us to use a stronger version of the Young Brothers Wait Concept. This
stronger version is closely related to the idea of the Delayed Branching Scheduling Strategy of
[Hsu90] for a distributed game tree searching program.
In addition we analyze the costs and gains of a distributed transposition table. This analysis
indicates that a distributed transposition table has large advantages compared to a local or
global one as long as fast message routing is guaranteed. We obtained a speedup of 126 running
our distributed algorithm with 256 processors at tournament speed.
1. INTRODUCTION
In this paper we describe an extension of the Young Brothers Wait Concept that was introduced to reduce search overhead in our distributed algorithm ([VM87, FMMV89, FMMV90, FMM90]). We present experimental data showing that this extension decreases the search overhead even better, resulting in a speedup of 126 using a Transputer system with 256 processors. This extension is inspired by the Delayed Branching Scheduling Strategy of Hsu ([Hsu90]). Hsu presented in his thesis a parallel algorithm that strictly dominates the weaker form of with the deep cutoffs disabled. Hsu claims a lower bound for the speedup to be "something like N/c, where c is about 3 to 5", in the range possibly up to 100000 processors. However
?This work was partly supported by grant no. Mo 285/32 from the DFG and by the ESPRIT Basic Research Action No. 7141 (ALCOM II)
there are no experimental results, due to the fact that his algorithm reflects a special hardware environment that is not available yet.
Only a few experiments have been conducted yet with massive parallelism (> 20 processors) : In [OF88] Otto and Felten reported on a speedup of 101 using a 256 Hypercube processors, however this was achieved by parallelizing a suboptimal version of the sequential algorithm. Another interesting result is the speedup of 12 achieved by Ferguson and Korf in [FK88] using 32 processors for an Othello playing program.
We presented a speedup of 25 with 32 processors and a speedup of 34 with 64 processors in [FMM90]. Our recent results indicate that our algorithm is scalable even beyond the 256 processors with expected good efficiency.
Furthermore we give a cost and gain study of our distributed transposition table presented in [FMM90]. This transposition table is a virtually global table that is physically distributed among all the processors in the network. Access to this table is by routing requests and entries from a demanding processor to the processor responsible for the entry. We show that in the processor range available the cost to access this table by routing is much smaller than the gain resulting from the huge table size. Moreover this approach is easily scalable without introducing bottlenecks (as when using a global transposition table located at a single processor) and without the loss of valuable informations (as when using tables with local access only).
In [MP85] Marsland and Popowich compared the use of local and global transposition tables. Schaeffer uses a hybrid version of these methods for his chess program Sun Phoenix ([MOS86, Sch89]). All these approaches suffer heavily from either the communication bottleneck or the information loss.
A short review of our algorithm is given in section 2., in the section 3. we describe the extension of the Young Brothers Wait Concept. The cost and gain analysis of the distributed transposition table in presented in section 4. In section 5. we give the results gained from the experiments concerning the performance of the transposition table and the overall performance of the algorithm.
2. REVIEW OF OUR DISTRIBUTED ALGORITHM
In this section we will briefly review our distributed algorithm described in [FMMV90, FMM90].
The general idea to use parallelism is to dynamically decompose the game tree that has to be searched. This is done in the following way:
ffl At the start of a game tree search all processors but one are idle. The busy one is responsible for the root and starts to search the game tree.
ffl Processors that search a subtree of the whole game tree produce subproblems that are queued for later evaluation.
ffl An idle processor chooses some other processor randomly and sends a request for work to this processor.
ffl If a processor p1 gets a request from processor p2 and has some subproblems queued, then it sends one of his subproblems to p2. A masterslave relationship is established between p1 and p2. If p1 has no subproblems queued then it sends the request to some randomly chosen neighbor in the processor network unless the message has traveled via a certain number of processors. In this case p1 throws away the request and informs p2 that the request has been thrown away. p2 then starts requesting again.
ffl After a processor has finished the evaluation of its subproblem, it returns the value of the root of its subproblem to its master.
ffl A master informs all its slaves about any relevant narrowing of the search windows immediately.
ffl The eldest son of any node must be completely evaluated before younger brothers of that node may be transmitted (Young Brothers Wait Concept).
This algorithm has very good load balancing properties. This is mainly due to the fact that the masterslave relationships are established dynamically and without any centralized control.
3. EXTENSIONS TO THE YOUNGBROTHERSWAIT CON
CEPT
In this section we present an extension of the Young Brothers Wait Concept. This extension can be viewed as an adaption of the Delayed Branching Scheduling Strategy in the environment of a distributed system.
The Delayed Branching Scheduling Strategy is described in [Hsu90]. The general idea is to avoid parallelism in some parts of the tree. The nodes of the minimal game tree are searched in parallel whenever possible. Some nodes of the rest of the game tree, however, are searched in a way that no brother of a node v is searched unless the evaluation of v has finished. Hsu showed that his parallel algorithm dominates the sequential algorithm without deep cutoffs. However, his algorithm needs a very fast central host processor, which controls the evaluation of the above part of the game tree.
In order to avoid this central host processor we use a stronger version of the Young Brothers Wait Concept to reduce search overhead in our distributed system. To do so we assign the same types to the nodes of the game tree as in [Hsu90]:
ffl The root of the game tree has type 1.
ffl The first successor position of a type 1 node has type 1, all other successors have type 2.
ffl The first successor position of a type 2 node has type 3, all other successors have type 2.
ffl All successor positions of a type 3 node have type 2.
We extend the Young Brothers Wait Concept by the following rule:
Parallelism is used at some type 2 node v to evaluate the successor positions of v only, if all promising moves have been evaluated one at a time before.
After these promising successors have been completely evaluated parallelism is allowed as for any other node in the game tree. A move is called promising if it has been found in the transposition table or if it is a nonnegative capture move or if it is a killer move. The idea is that if a cutoff occurs at node v then this cutoff is produced very often by one of the promising moves. This stronger version of the Young Brothers Wait Concept can be considered as a weaker version of the Delayed Branching Scheduling Strategy.
The first difference is that we restrict parallelism at a node v only if there are still some promising successor positions unevaluated. The second difference to the method of Hsu is that we allow different type 2 siblings that have a cutoff failure after the evaluation of the first successor position to be reexpanded in parallel. In the algorithm of Hsu this is controlled by the host processor. In our distributed algorithm processors would have to communicate in order to guarantee that the above rule is not violated.
We will show in section 5. that this reduction of parallelism results in only a slightly weaker processor load but reduces the search overhead significantly.
4. COSTS AND GAINS OF A DISTRIBUTED TRANSPO
SITION TABLE
In [FMM90] we describe how we use a distributed transposition table in our distributed algorithm. What was left open was the question how much the parallel version of our algorithm profits by the larger transposition table it is allowed to use.
In this section we will discuss the use of a distributed transposition table. We will outline the costs in terms of communication costs and delayed transposition table access. On the other hand we will give some insights, how much the distributed algorithm gains by the enlarged transposition table.
Global transposition tables have been shown to decrease the performance of a parallel game tree search algorithm (see [MP85]). All processors want to access the global transposition table placed at one processor. This leads to a bottleneck in the algorithm. Therefore using this approach the alternatives are to restrict the number of accesses to the transposition table or to overload the communication capabilities of the processor that holds the global transposition table. The first alternative results in a large search overhead of the parallel algorithm, the second alternative results in a delayed transposition table access.
The approach of local transposition tables where every processor can access only his own table has been shown to increase search overhead ([MP85]). Hybrid versions of these two approaches did not perform well ([Sch89]).
Our sequential version of the algorithm uses a transposition table with 16 ? 104 entries. The speed of the sequential version is roughly 400 nodes per second. We use a depth oriented transposition table, i.e. entries of nodes higher in the tree are not overwritten by entries of nodes deeper in the tree. Transposition table access during a depth d search is done only for nodes at depth < d. In the distributed version of our algorithm we make use of a transposition table of size p ? 16 ? 104 where p is the number of processors. Logically we look at all the local tables as one global transposition table, which is distributed over the whole system. Therefore we allow a processor to access the transposition table of any other processor.
In [FMM90] we describe how we distribute the elements of a large transposition table over the distributed system. Remote transposition access is done by message routing. Therefore in larger distributed systems almost every transposition access must be organized by routing messages between two processors. In order to read some transposition table entry, the processor that wants to read sends a readrequest to the processor that holds this entry. This processor reads the contents of the entry and sends them back. In order to store a transposition table entry the processor that wants to store the entry sends the data to the processor that holds the entry. The receiver of a store command handles this command as if its transposition table was used in the sequential algorithm. Thus for almost every node, the transposition table is accessed for, three messages (readrequest, readanswer, store) are routed through the network.
This communication causes two kinds of delays.
1. The read access to the transposition table is delayed by the time necessary to route the messages.
2. The routing of messages as well as the serving of remote transposition table accesses delays the processor, i.e. the performance of the distributed algorithm is less than 400 nodes per second.
To keep the first kind of delay small we use the method of preupdating a position. Whenever a processor p starts the evaluation of a node v and wants to read the transposition entry for node v, p updates the hash function of v, sends a readrequest for v to the processor that holds the entry for v and then updates the position v itself. After the position v is updated p waits for the readanswer message that gives possibly some information about node v. In almost all cases the information from the transposition table does not cause the node v to be cut off. Therefore the update of position v would have been done anyway.
We use the deBruijn graph [deB46] as a processor network. The deBruijn graph has diameter log2(n) where n is the number of nodes. This guarantees that the average distance between two nodes is very small. Therefore messages are routed only via a small number of processors to reach their target processor. Thus answers to readrequest messages will arrive quickly and only a small number of processors is interrupted by the message.
Figure 1: Using transposition tables of various size
We will show in section 5. that both of these delays are kept quite small in our algorithm.
To point out the gains the sequential algorithm gets by a larger transposition table we measured the performance of the sequential algorithm for various transposition table sizes. Those sizes which are larger than 16?104 are obtained by simulating the sequential algorithm in a distributed system.
Figure 1 shows the number of leaves that are searched during a 7ply search on the BratkoKopec set of 24 test positions ([BK82]).
Figure 2 shows the percentage of overwrites and expected overwrites relatively to the number of all stores that happened during the search. The percentage of expected overwrites Povw can be calculated by the formula
Povw := 100 ? (
i=24
X
i=1
si)=(
i=24
X
i=1
pi)
Figure 2: Overwrites and expected overwrites
where si is the number of stores that happened during the search of the ith problem. Here
pi := si ? m ? [1 ? (m ? 1
m )si ]
where m is the size of the transposition table is the number of overwrites that are to be expected during the search of the ith problem. We observe that the percentage of overwrites is a little bit smaller than the percentage of expected overwrites. This is due to our hash function which is slightly better than random.
More important is the fact, that for all tests with an percentage of expected overwrites <= 50% the number of leaves remains nearly constant. Using a 16 ? 104 sized transposition table on the average roughly 18 ? 104 nodes required a transposition access. In the next section we will present speedup data for 8ply searches. During an 8ply searches with one processor roughly 125 ? 104 nodes on the average require a transposition access. This results in a percentage of expected overwrites of 87:2% when a 16 ? 104 sized transposition table is used. Therefore the sequential version would need a transposition table of size >= 78 ? 104 to achieve a percentage of expected overwrites <= 50%. On the one side this means that one has to increase the size of the transposition table by more than a factor of 4.5 per processor in order to achieve a transposition
table for the sequential algorithm which guarantees good performance of an 8ply search. This however is impossible due to hardware constraints. On the other side it turns out that the 256 processor system has only very small advantage compared to the 8 processor system, because both systems have enough transposition table entries to guarantee good performance of 8ply searches. It was impossible for us to run all the sequential 8ply searches for various transposition table sizes. However, if the curves for the 8ply searches can be compared to the curves of the 7 ply searches then this indicates that the sequential version of our algorithm could be speeded up by the use of a larger transposition table by roughly 20% if there were no hardware constraints. A sequential algorithm that is 20% faster would decrease the speedups we present in section 5. by 20%.
5. EXPERIMENTAL RESULTS
In this section we show how effective the distributed chess program ZUGZWANG searches game trees using the above described methods. We present results for speedup, work load, search overhead, delay of the transposition table accesses and the delay caused by routing messages. We will show, that the extension of the Young Brothers Wait Concept decreases the search overhead.
All the results we present here are obtained from an 8ply search on the BratkoKopec set of test positions. The nondeterminism of the transposition access causes differences between the runtimes of the parallel algorithm for single positions of the test set. However, we observed that these effects are very small (<= 3% for 256 processors) if the whole test set is considered.
For fixed search depth d we define the following measures for the performance of our distributed algorithm running a dply search: Let Bi be the ith position from the BratkoKopec set of test positions, Let Ti(p) be the total time, Ii(p) the average idle time and Ni(p) the number of nodes visited by the p processor version of our distributed algorithm for a dply search on Bi. Let Wi(p) be the average amount of time a processor spent waiting for all answers from the transposition table during a dply search on Bi after it has finished the update of the position.
Then we define
SPE(p) := (
24
X
i=1
Ti(p))=(
24
X
i=1
Ti(1))
LD(p) := 100 ? [1 ? (
24
X
i=1
Ii(p))=(
24
X
i=1
Ti(p))]
SO(p) := 100 ? [1 ? (
24
X
i=1
Ni(p))=(
24
X
i=1
Ni(1))]
RD(p) := 100 ? [
P24i=1 Ni(p)
P24i=1(Ti(p) ? Ii(p))=
P24i=1 Ni(1)
P24i=1 Ti(1) ]
TD(p) := 100 ? [(
24
X
i=1
Wi(p))=(
24
X
i=1
Ti(p))]
Figure 3: Speedups for 7ply and 8ply search
to be the speedup, work load, search overhead, delay caused by routing messages and delay caused by the transposition access of the p processor version respectively.
The curves for SPE(p), LD(p), SO(p), RD(p) and TD(p) are given in the three diagrams 3, 4, 5 and 6 for an 8ply search and processor numbers 4,8,16,32,64,128 and 256.
It is worth to note that the average runtime of the 256 processor system for an 8ply search of a test position is 300 sec, which is tournament speed.
The speedup of the 256 processor system is 126:36, the work load is 82%. The search overhead is kept quite small (44:5%) by the extension of the Young Brothers Wait Concept.
The curves in figure 6 show the delay caused by message routing and delayed access to the transposition table. The delay caused by delayed transposition access is slightly increasing. However, it is negligible even for the 256 processor system (5%). The second curve shows the loss of performance that is caused by routing messages. This delay is roughly 10% for the 256 processor system.
Figure 4: Processor load for 7ply and 8ply search
In figure 7 we compare the results of three different versions of ZUGZWANG running on 256 processors: The column YBWC+ shows the performance we mentioned above. The column YBWC shows the performance of ZUGZWANG without the extensions of the Young Brothers Wait Concept. The reduction of parallelism in the search decreased the search overhead by 15% On the other side the excellent load balancing properties of our algorithm are the reason for the fact, that the work load of the processors is nearly the same.
The column TR gives the performance of our system when the distributed transposition table has the same size as the sequential one. That is we reduced the transposition table size at every processor from 16 ? 104 to 625. The speedup drops from 126 to 100. This is the same loss of speedup that we expected in section 4. in the case that the sequential algorithm would be allowed to use a transposition table of the same size as the distributed algorithm.
Figure 5: Search Overhead for an 8ply search
6. CONCLUSIONS
We have shown that our distributed chess program benefits by the use of a distributed transposition table. On the other side we pointed out that the distributed transposition table is not the only reason for the good behavior of our algorithm. As stated before in [FMM90] the very good load balancing properties of our algorithm guarantee a good work load of the processors even in very large distributed systems. These good load balancing properties enabled us to reduce the parallelism during the game tree search by extending the Young Brothers Wait Concept. This reduction of parallelism results in smaller search overhead. The work load of the processors, however, remains nearly the same.
References
[BK82] I. Bratko and D. Kopec. A Test for Comparison of Human and Computer Performance in Chess, in Advances in Computer Chess 3, M.R.B. Clarke (editor), pages
Figure 6: Processor delay during 7ply and 8ply search
31{56. Pergamon Press, 1982.
[deB46] N.G. deBruijn. A combinatorial problem. Indagationes Math., 8:pp 461{467, 1946.
[FK88] Ch. Ferguson and R.E. Korf. Distributed tree search and its application to alphabeta pruning. Proceedings AAAI88, Seventh National Conference on Artificial Intelligence, 2:pp 128{132, 1988.
[FMM90] R. Feldmann, B Monien, and P. Mysliwietz. A fully distributed chess program. Advances in Computer Chess VI (D. Beal ed.), pages pp 1{27, 1990.
[FMMV89] R. Feldmann, B. Monien, P. Mysliwietz, and O. Vornberger. Distributed gametree search. ICCA Journal, 12(2):pp 65{73, 1989.
[FMMV90] R. Feldmann, B. Monien, P. Mysliwietz, and O. Vornberger. Distributed Game Tree Seach, in Parallel Algorithms for Machine Intelligence and Pattern Recognition, V. Kumar, L.N. Kanal and P.S. Gopalakrishnan (editors). Springer Verlag, 1990.
[Hsu90] F.H. Hsu. Large Scale Parallelization of AlphaBeta Search: An Algorithmic Architectural Study with Computer Chess. PhD thesis, Carnegie Mellon University, Pittsburgh, USA, 1990.
[MOS86] T.A. Marsland, M. Olafsson, and J. Schaeffer. Multiprocessor TreeSearch Experiments, in Advances in Computer Chess 4 D.F. Beal (editor), pages 37{51. Pergamon Press, 1986.
[MP85] T.A. Marsland and F. Popowich. Parallel game tree search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 7(4):pp 442{452, 1985.
[OF88] S.W. Otto and E.W. Felten. Chess on a hypercube. The Third Conference on Hypercube Concurrent Computers and Applications, 2:pp 1329{1341, 1988.
[Sch89] J. Schaeffer. Distributed gametree searching. Journal of Parallel and Distributed Computing, 6(2):pp 90{114, 1989.
[VM87] O. Vornberger and B. Monien. Parallel alphabeta versus parallel sss*. Proceedings IFIP Conference on Distributed Processing, Distributed Processing, North Holland, pages 613{625, 1987.
YBWC+ YBWC TR
SPE(p) 126.3 118.1 99.8
LD(p) 82.0 84.9 84.5
SO(p) 44.5 60.5 89.6
Figure 7: Performance of the 256 processor system