4.2 INFERRING L-SYSTEMS FROM EXAMPLE STRINGS 85
Figure?4.6 The first part of the sequence from Figure?4.3b, step three
(a) parsed correctly
(b) parsed incorrectly
(c) an improved parsing using bracket nesting constraint
the substrings that the rules represent. This grammar bears little resemblance to the grammar in 4.3e: it is much larger, and lacks a rule representing the fundamental f[+f]f[?f]f building block. It appears that SEQUITUR is incapable of recognising the structure in the sequence.
The cause of this poor performance is the greedy left-to-right processing. Once SEQUITUR forms a rule, the two symbols in that rule are bound together for the rest of the processing. The rule may be incorporated into a longer rule, but this longer rule will still group the two symbols together. For example, the initial sequence f[+f]f[?f]f should be in a single rule on its own. However, the [ following this sequence means that SEQUITUR takes advantage of the f]f[ repetition and forms a rule for it. Figure?4.6a shows the preferred parse and Figure?4.6b shows the actual parse after SEQUITUR has seen the first 24 symbols. In the preferred parse, one rule covers both instances of f[+f]f[?f]f (only the rules that appear in rule S are shown). In Figure?4.6b the darker grey area denotes the incorrect rule covering f]f[, which spans the end of one rule and the [ from one level above.
The next two sections describe two distinct solutions to this problem: using domain knowledge to aid SEQUITUR?s selection of phrases, and performing retrospective reparsing in order to improve SEQUITUR?s performance.
4.3 Domain knowledge
One solution to the incorrect partitioning of the sequence into rules is to use knowledge about the preferred form of rules to help SEQUITUR choose the correct rules in the first place. In the simple example illustrated in Figure?3.1d, ignoring the first repetition of aa would result in a better overall grammar. If an oracle were