TidBITS#472   19990322
 MailBITS/22-Mar-99 Explaining All Those List Headers Power Macintosh G3: The Cannonball Express Foot Notes

## Power Macintosh G3: The Cannonball Express

by Rick Holzgrafe <rick@kagi.com>

The Cannonball Express was the fabled train that was so fast it took three men to say "Here she comes," "Here she is," and "There she goes." Computers are fast too, although unlike trains, most aren't self-propelled. What makes a computer fast, and how much effect does software design have? How much faster are today's computers than yesterday's? Recently I revisited some of these questions, beginning with a trip down memory lane.

Back in the Stone Age -- Twenty years ago, I was teaching myself programming and had access to a DEC PDP-11/60 minicomputer on evenings and weekends. This beast was bigger than a washing machine, and during workdays I shared it with two dozen other technicians and engineers. I found a word puzzle in a magazine and thought it would be fun to program the PDP to solve it. The puzzle was as follows.

Given a phrase and a sheet of graph paper, write the phrase on the graph paper according to these rules:

1. Write one letter per square on the graph paper, like filling in a crossword puzzle. Ignore case and anything that's not one of the 26 letters of the alphabet. The phrase "N. 1st Street" is thus identical to "NSTSTREET".
2. Put the first letter of the phrase in any square you like.
3. After writing any letter, put the next letter of the phrase in any adjacent square. Here, "adjacent" means any of the eight neighboring squares, up, down, left, right, or diagonally. You may reuse a square if it is adjacent and already holds the letter you need; otherwise you must use a blank square. You can't use the same square twice in a row - no "hopping in place" for double letters.

The goal is to write the phrase inside a rectangle of the smallest possible area. (A subtle point: you are not trying to write in a minimal number of squares.) To score your solution, draw the smallest enclosing rectangle you can and take its area. The rectangle may enclose some blank squares; count them, too.

Got it? Tongue-twisters are the most fun because they have lots of opportunities to reuse whole snaky strings of squares. The 37 letters in "Peter Piper picked a peck of pickled peppers" can be packed into a 3 by 5 rectangle, like this (view this in a monospaced font):

```   OFIPT
KCPER
LEDAS
```

In those days I knew computers were "fast" but had no idea how fast. The answer turned out to be "not very." I wrote a program to solve these puzzles and called it Piper after the tongue-twister. I set Piper running on a medium-length phrase on a Friday evening, and came back on Monday to find it still running. It had found several less-than-best solutions but hadn't finished. Way too slow - I found a better solution myself on paper in about half an hour.

Why did it take so long? Piper was a "brute force" program. It tried every possible solution to the problem, one after another. The trouble is that there are too many possible solutions. Exactly how many depends on the phrase, but for any non-trivial phrase the number is astronomical. I realized for the first time that "fast" sometimes isn't "fast enough." This point may be obvious today, when we all use computers and are weary of waiting for them. But in 1979, that PDP-11 was only the second computer I had ever seen!

What Part of Fast Don't You Understand? I saw that I would have to make Piper faster. There are two basic ways to speed up a program. Plan A is to find a better way of solving the problem, but after twenty years I still haven't thought of a better solution. That leaves plan B, the classic efficiency expert's solution: eliminate unnecessary steps. For example, Piper created every possible solution, then calculated the area of each. It built each solution one letter at a time, so instead of taking the area only for completed solutions, I changed Piper to check the area after placing each letter. If placing a letter made the solution-in-progress take up more space than the smallest complete solution found so far, Piper could skip the rest of that solution (and all other solutions that started the same way) and move right on to the next one. This eliminated a huge amount of work and greatly improved Piper's speed. Finding clever ways to track the area of a growing solution helped too, because it was faster than calculating the area from scratch after each letter. I also found a way to calculate a minimum size for the final solution quickly: I couldn't guarantee that the best solution would be that small, but I could guarantee that it wouldn't be smaller. If Piper got lucky and found a solution as small as that calculated minimum, it could stop immediately. Otherwise it would continue on after finding the best solution, vainly seeking a still better one.

Eventually Piper became clever enough to finish that original phrase in a reasonably short period of time. But the holy grail continued to elude me: I wanted a solution for "How much wood would a woodchuck chuck if a woodchuck could chuck wood?" That PDP (and, perhaps, my cleverness) were not up to the task. I had run out of ideas for speeding up Piper, and runs still took longer than a weekend. But if I couldn't improve Piper, I could at least hope to run it on a faster computer.

Big Iron -- People tend to think of processor speed as the speed of a computer, but many factors affect overall performance. Virtual memory lets you work on bigger data sets or on more problems at a time, but it's slow, so adding more physical RAM helps by reducing your reliance on virtual memory. Faster disks and I/O buses load and save data more quickly. RAM disks and disk caching replace slow disk operations with lightning-quick RAM access. Instruction and data caches in special super-fast RAM offer big improvements for some programs. Well-written operating systems and toolboxes can outrun poorly written ones.

But in the end, little of this matters to Piper. Piper has always used only a small amount of data, doesn't read or write the disk after it gets going, and does little I/O of any kind. With its small code and data set Piper can take good advantage of data and instruction caching, but what it mostly needs is "faster hamsters" - a faster processor to make the wheels turn more quickly.

As the years rolled on, I ran versions of Piper on my first Macs, but in the middle 1980's I worked for Apple Computer, and had access to a programmer's dream: Apple's \$15 million Cray Y-MP supercomputer, one of only two dozen in the world and arguably the fastest computer in existence at the time. I figured the Cray would make short work of Piper. But the Cray was not well suited to the problem. It could barrel through parallel-processing floating-point matrix calculations like the Cannonball Express, but Piper was a highly linear, non-mathematical problem. Piper used only one of the Cray's four processors and didn't do the kind of operations at which the Cray excelled. Piper wasn't a fair test of the Cray's power, but the Cray was still the fastest machine I'd ever used. The Cray succeeded where all previous machines (that PDP, my Mac Plus, my Mac II) had failed. It solved "woodchuck" in less than a day, taking only about 20 hours to finish its run. I was awestruck - 20 hours?! I'd no idea that "woodchuck" was that big a problem!

Young Whippersnappers -- I set Piper aside for many years, but recently I began to wonder how a modern desktop box compares to those old minicomputers and mainframes. I rewrote Piper from memory and ran it on my new 400 MHz ice-blue Power Macintosh G3 with "woodchuck." The output is below. Piper first reprints the phrase, then prints solutions and elapsed times as it finds them. Each solution is the best found so far, culminating in the best of all. The times are in seconds from the beginning of the run; the final time is the total run time. (Unfortunately, the best solution for "woodchuck" is larger than Piper's calculated minimum, so Piper continued to run for a bit after finding the best solution.)

Here are the results. Some intermediate solutions have been left out for brevity, but you can see Piper finding ever smaller solutions. In the end, the 57 letters in "woodchuck" are packed into a 4 by 4 rectangle. Have a look at Piper's total run time, and the time needed to find its best solution:

```How much wood would a woodchuck chuck if a woodchuck could chuck wood?

0 seconds:
HDOAIUCOHWUHOD
UCOWFKHDOMCWOW
K   WO

1 seconds:
UCKOHWUHOD
OHDWOMCWOW

2 seconds:
ULCHC
HWUHODKUK
OMCWOWAFI

7 seconds:
HWUHOW
OMCWOD
IKAUCW
FLDHK

9 seconds:
HWUH
OMCW
LUOK
HDOI
CWAF

65 seconds:
HWM
OUC
IKH
FWC
LUO

67 seconds:
HWMU
OOCH
UDWK
LAFI

Total run time: 107 seconds
```

There you have it: a shade over a minute to find the best solution, under two minutes to finish its run. Two minutes! So much for the big iron of the 1980's. My new G3 Mac finished "woodchuck" over 600 times faster (and 5,000 times cheaper) than that 15 megabuck Cray. (For a more realistic comparison, see this description of UCLA's Project Appleseed.)

If you want to check Piper's speed on your Mac, I've placed the code in the public domain; it's a 40K package.

<ftp://ftp.tidbits.com/pub/tidbits/misc/piper.hqx>

The Future -- What's yet to come? 400 MHz already looks a little pokey. It's the best Apple offers today, but I've seen claims of 550 MHz or so from third party accelerators and over-clocking tricks. People are predicting 1 GHz (1,000 MHz) chips for the near future. Buses are getting faster, and caches hold more data in less space and are moving onto the processor chip for still more speed. (Small is fast. Did you know that the speed of light is a serious limiting factor in modern computer design? The closer together the components are, the faster they can signal each other.)

And like the old Cray, multi-processor desktop systems are starting to appear. They gang up on a problem by having separate processors work on different parts of the problem simultaneously. Although I didn't try to use the Cray's extra processors, I've done a little thinking lately. Piper doesn't have to be completely linear. On an eight-processor system, I bet I could come close to making Piper run in one-eighth of the time of a single processor.