page 1  (12 pages)
2to next section

184 CHAPTER 8: CONCLUSIONS

algorithm to the sequence first, and then produce an automaton from the higher-

level non-terminals. This is a compelling symbiosis: SEQUITUR identifies the

significant segments of the sequences, and the automaton captures the branching

and looping relationships between them. The combination is analogous to the

lexical analyser and parsing passes of a compiler, where the first pass groups

individual characters into meaningful tokens for the second pass. The main problem

with this approach is the ease with which SEQUITUR can be misled to recognise

segments that include more than a single token. Investigating solutions to this

problem is also a direction for future work.

Another technique for generalising grammars involves recognising self-similarity in

the hierarchical structure of a sequence (?recognising self-similarity? in Figure?8.1).

Recursive grammars such as the fractal L-system grammars for producing snowflake

and tree-like figures give rise to similar patterns at different levels of detail. With

the improved inference achieved by retrospective reparsing, this self-similarity is

clearly exhibited in SEQUITUR?s hierarchies. Automatically recognising this

recursive structure consists of finding a unification of two rules that produces a

concise recursive grammar capable of reproducing the original sequence. Again, the

algorithm is simple: it consists of only six Prolog clauses.

The final major technique introduced is a way of encoding a grammar to achieve

data compression (?encoding scheme? in Figure?8.1). This transforms the SEQUITUR

algorithm into a compression scheme that can be evaluated alongside a large cohort

of similar schemes. It outperforms all other techniques in its class, and in several

instances achieves results better than any current compressor. Simply recoding the

final grammar for the sequence does not result in the best performance. Instead,

SEQUITUR operates at both the encoding and decoding ends, and enough

information is transmitted to trigger the two constraints operating at the decoder.

8.2 The applications

The problem of inferring concise and correct grammars from sequences precludes

optimal solutions. It is known from data compression that the problem of finding

the smallest hierarchy for a sequence is NP-complete (Storer, 1982). It is also

known from grammatical inference that no algorithm can guarantee to infer a

grammar from the sequences that it produces (Gold, 1967). Furthermore, in many