| ![]() | |||||||||
184 CHAPTER 8: CONCLUSIONS
algorithm to the sequence first, and then produce an automaton from the higher-
level non-terminals. This is a compelling symbiosis: SEQUITUR identifies the
significant segments of the sequences, and the automaton captures the branching
and looping relationships between them. The combination is analogous to the
lexical analyser and parsing passes of a compiler, where the first pass groups
individual characters into meaningful tokens for the second pass. The main problem
with this approach is the ease with which SEQUITUR can be misled to recognise
segments that include more than a single token. Investigating solutions to this
problem is also a direction for future work.
Another technique for generalising grammars involves recognising self-similarity in
the hierarchical structure of a sequence (?recognising self-similarity? in Figure?8.1).
Recursive grammars such as the fractal L-system grammars for producing snowflake
and tree-like figures give rise to similar patterns at different levels of detail. With
the improved inference achieved by retrospective reparsing, this self-similarity is
clearly exhibited in SEQUITUR?s hierarchies. Automatically recognising this
recursive structure consists of finding a unification of two rules that produces a
concise recursive grammar capable of reproducing the original sequence. Again, the
algorithm is simple: it consists of only six Prolog clauses.
The final major technique introduced is a way of encoding a grammar to achieve
data compression (?encoding scheme? in Figure?8.1). This transforms the SEQUITUR
algorithm into a compression scheme that can be evaluated alongside a large cohort
of similar schemes. It outperforms all other techniques in its class, and in several
instances achieves results better than any current compressor. Simply recoding the
final grammar for the sequence does not result in the best performance. Instead,
SEQUITUR operates at both the encoding and decoding ends, and enough
information is transmitted to trigger the two constraints operating at the decoder.
8.2 The applications
The problem of inferring concise and correct grammars from sequences precludes
optimal solutions. It is known from data compression that the problem of finding
the smallest hierarchy for a sequence is NP-complete (Storer, 1982). It is also
known from grammatical inference that no algorithm can guarantee to infer a
grammar from the sequences that it produces (Gold, 1967). Furthermore, in many