View the PDF document On tag insertion and its complexity

Yeates, S. A., Witten, I. H. (2000) Proc PRICAI 2000: International Workshop on Text and Data Mining,edited by A.-H. Tan and P. Yu, Melbourne, Australia, August, pp 52-63.

Many text mining operations can be recast as tag insertion problems - we illustrate several. The size of the search space of the tag insertion problem is explored, using a number of proofs and heuristics: Viterbi Search, One-Tag-at-a-Time and Automatic Tokenisation. These greatly reduce the size of the search space from approximately 10400 to approximately 1011 nodes. Properties of the SGML standard are also used to reduce the complexity of the search. We work through several examples taken from bibliographies, showing search space size and possible errors.