page 2  (22 pages)
to previous section1
3to next section

- 2 -

general framework.

In this article, the term pattern is used to mean any sequence of atomic symbols including subsequences within a larger sequence where the symbols in the subsequence are not necessarily contiguous within the larger sequence.

2 MULTIPLE ALIGNMENT PROBLEMS

The term ?multiple alignment? is normally associated with the computational analysis of DNA sequences or sequences of amino acid residues as part of the process of elucidating the structure, functions or evolution of the corresponding molecules.

The general idea is to examine two or more sequences to find one or more alignments of matching bases or amino acid residues which are, in some sense, optimal. An example is presented in Fig. 1.

-------------------------------------------------------------------------
Please insert Fig. 1 about here
-------------------------------------------------------------------------

Fig. 1. An alignment amongst five DNA sequences (adapted from Fig. 6 in [13], with permission from Oxford University Press).

2.1 What is an ?Optimal? Alignment?

Intuitively, an ?optimal? or ?good? alignment amongst two or more sequences of symbols is one which is relatively long, involves a relatively large number of sequences, has few gaps (or no gaps at all) and, where there are gaps, these should be relatively short.

These kinds of measures are used directly in many studies of multiple alignment. But it seems that our intuitions in this area may also be formalised in terms of probabilities and compression. These two kinds of analysis are complementary and not in conflict - they are two sides of the information theory coin. Recent studies in this tradition include [1, 2, 4, 16, 22].

2.2 Searching for Optimal Alignments

In this area of research, it is widely recognised that, when searching for good alignments between two or more sequences, the space of possible alignments is normally too big to be examined exhaustively. For realistically large examples it is necessary to use ?heuristic? techniques (eg ?hill climbing?, ?beam search?) which may be described generically as ?metrics-guided search?. Existing methods are reviewed in [3, 4, 7, 15]. Also relevant is the method for finding alignments between two sequences which is described in [22].

Given the variety of techniques which already exist to find multiple alignments, one might expect to find one that could be incorporated in a model of learning, reasoning and other aspects of computing as discussed in this and other publications in this research programme. However, an examination of the literature has shown that existing techniques do not precisely meet the needs of the SP programme. Given the differences in objectives between biochemistry and the SP programme, this is not very surprising.

Work is now proceeding on a new search method within a new SP model. The new model is not yet sufficiently mature or robust to demonstrate all the points made in this article. However, it is good enough to solve the geometric analogy problem described in Section 5.7 and other simple