R. Begleiter, R. El-Yaniv and G. Yona
Volume 22, 2004
Links to Full Text:
Journal of Artiï¬cial Intelligence Research 22 (2004) 385-421 Submitted 05/04; published 12/04 On Prediction Using Variable Order Markov Models Ron Begleiter ronbeg@cs.technion.ac.il Ran El-Yaniv rani@cs.technion.ac.il Department of Computer ScienceTechnion - Israel Institute of TechnologyHaifa 32000, Israel Golan Yona golan@cs.cornell.edu Department of Computer ScienceCornell UniversityIthaca, NY 14853, USA Abstract This paper is concerned with algorithms for prediction of discrete sequences over a ï¬nite alphabet, using variable order Markov models. The class of such algorithms is large and inprinciple includes any lossless compression algorithm. We focus on six prominent predictionalgorithms, including Context Tree Weighting (CTW), Prediction by Partial Match (PPM)and Probabilistic Suï¬x Trees (PSTs). We discuss the properties of these algorithms andcompare their performance using real life sequences from three domains: proteins, Englishtext and music pieces. The comparison is made with respect to prediction quality asmeasured by the average log-loss. We also compare classiï¬cation algorithms based on thesepredictors with respect to a number of large protein classiï¬cation tasks. Our results indicatethat a âdecomposedâ CTW (a variant of the CTW algorithm) and PPM outperform allother algorithms in sequence prediction tasks. Somewhat surprisingly, a diï¬erent algorithm,which is a modiï¬cation of the Lempel-Ziv compression algorithm, signiï¬cantly outperformsall algorithms on the protein classiï¬cation problems. 1. Introduction Learning of sequential data continues to be a fundamental task and a challenge in patternrecognition and machine learning. Applications involving sequential data may require pre-diction of new events, generation of new sequences, or decision making such as classiï¬cationof sequences or sub-sequences. The classic application of discrete sequence prediction algo-rithms is lossless compression (Bell et al., 1990), but there are numerous other applicationsinvolving sequential data, which can be directly solved based on eï¬ective prediction of dis-crete sequences. Examples of such applications are biological sequence analysis (Bejerano& Yona, 2001), speech and language modeling (Sch¨ utze & Singer, 1994; Rabiner & Juang, 1993), text analysis and extraction (McCallum et al., 2000) music generation and classiï¬ca-tion (Pachet, 2002; Dubnov et al., 2003), hand-writing recognition (Singer & Tishby, 1994)and behaviormetric identiï¬cation (Clarkson & Pentland, 2001; Nisenson et al., 2003).1 1. Besides such applications, which directly concern sequences, there are other, less apparent applications, such as image and texture analysis and synthesis that can be solved based on discrete sequence predictionalgorithms; see, e.g., (Bar-Joseph et al., 2001). c 2004 AI Access Foundation. All rights reserved.
Begleiter, El-Yaniv & Yona The literature on prediction algorithms for discrete sequences is rather extensive and oï¬ers various approaches to analyze and predict sequences over ï¬nite alphabets. Perhapsthe most commonly used techniques are based on Hidden Markov Models (HMMs) (Rabiner,1989). HMMs provide ï¬exible structures that can model complex sources of sequential data.However, dealing with HMMs typically requires considerable understanding of and insightinto the problem domain in order to restrict possible model architectures. Also, due totheir ï¬exibility, successful training of HMMs usually requires very large training samples(see hardness results of Abe & Warmuth, 1992, for learning HMMs). In this paper we focus on general-purpose prediction algorithms, based on learning Vari- able order Markov Models (VMMs) over a ï¬nite alphabet Σ. Such algorithms attempt tolearn probabilistic ï¬nite state automata, which can model sequential data of considerablecomplexity. In contrast to N -gram Markov models, which attempt to estimate conditionaldistributions of the form P (Ï|s), with s â ΣN and Ï â Σ, VMM algorithms learn such con-ditional distributions where context lengths |s| vary in response to the available statistics inthe training data. Thus, VMMs provide the means for capturing both large and small orderMarkov dependencies based on the observed data. Although in general less expressive thanHMMs, VMM algorithms have been used to solve many applications with notable success.The simpler nature of VMM methods also makes them amenable for analysis, and someVMM algorithms that we discuss below enjoy tight theoretical performance guarantees,which in general are not possible in learning using HMMs. There is an intimate relation between prediction of discrete sequences and lossless com- pression algorithms, where, in principle, any lossless compression algorithm can be usedfor prediction and vice versa (see, e.g., Feder & Merhav, 1994). Therefore, there exist anabundance of options when choosing a prediction algorithm. This multitude of possibilitiesposes a problem for practitioners seeking a prediction algorithm for the problem at hand. Our goal in this paper is to consider a number of VMM methods and compare their performance with respect to a variety of practical prediction tasks. We aim to provideuseful insights into the choice of a good general-purpose prediction algorithm. To thisend we selected six prominent VMM algorithms and considered sequential data from threediï¬erent domains: molecular biology, text and music. We focus on a setting in which alearner has the start of some sequence (e.g., a musical piece) and the goal is to generatea model capable of predicting the rest of the sequence. We also consider a scenario wherethe learner has a set of training samples from some domain and the goal is to predict, asaccurately as possible, new sequences from the same domain. We measure performanceusing log-loss (see below). This loss function, which is tightly related to compression,measures the quality of probabilistic predictions, which are required in many applications. In addition to these prediction experiments we examine the six VMM algorithms with respect to a number of large protein classiï¬cation problems. Protein classiï¬cation is oneof the most important problems in computational molecular biology. Such classiï¬cation isuseful for functional and structural categorization of proteins, and may direct experimentsto further explore and study the functionality of proteins in the cell and their interactionswith other molecules. In the prediction experiments, two of the VMM algorithms we consider consistently achieve the best performance. These are the âdecomposed context tree weighting (CTW)âand âprediction by partial match (PPM)â algorithms. CTW and PPM are well known in the 386
On Prediction Using Variable Order Markov Models lossless compression arena as outstanding players. Our results thus provide further evidenceof the superiority of these algorithms, with respect to new domains and two diï¬erent pre-diction settings. The results of our protein classiï¬cation experiments are rather surprisingas both of these two excellent predictors are inferior to an algorithm, which is obtained bysimple modiï¬cations of the prediction component of the well-known Lempel-Ziv-78 compres-sion algorithm (Ziv & Lempel, 1978; Langdon, 1983). This rather new algorithm, recentlyproposed by Nisenson et al. (2003), is a consistent winner in the all the protein classiï¬cationexperiments and achieves surprisingly good results that may be of independent interest inprotein analysis. This paper is organized as follows. We begin in Section 2 with some preliminary deï¬- nitions. In Section 3, we present six VMM prediction algorithms. In Section 4, we presentthe experimental setup. In Sections 5-6, we discuss the experimental results. Part of theexperimental related details are discussed in Appendices A-C. In Section 7, we discuss therelated work. In Section 8, we conclude by introducing a number of open problems raisedby this research. Note that the source code of all the algorithms we consider is available athttp://www.cs.technion.ac.il/ ronbeg/vmm. 2. Preliminaries Let Σ be a ï¬nite alphabet. A learner is given a training sequence qn1 = q1q2 · · · qn, where qi â Σ and qiqi+1 is the concatenation of qi and qi+1. Based on qn1, the goal is to learn a model Ë P that provides a probability assignment for any future outcome given some past. Speciï¬cally, for any âcontextâ s â Σâ and symbol Ï â Σ the learner should generate aconditional probability distribution Ë P (Ï|s). Prediction performance is measured via the average log-loss ( Ë P , xT1 ) of Ë P (·|·), with respect to a test sequence xT1 = x1 · · · xT , 1 T ( Ë P , xT1 ) = â log Ë P (x T i|x1 · · · xiâ1), (1) i=1 where the logarithm is taken to base 2. Clearly, the average log-loss is directly related to thelikelihood Ë P (xT Ë 1 ) = T P (x i=1 i|x1 · · · xiâ1) and minimizing the average log-loss is completely equivalent to maximizing a probability assignment for the entire test sequence. Note that the converse is also true. A consistent probability assignment Ë P (xT1 ), for the entire sequence, which satisï¬es Ë P (xtâ1 Ë 1 ) = x P (x1 · · · xtâ1xt), for all t = 1, . . . , T , tâΣ induces conditional probability assignments, Ë P (xt|xtâ1 1 ) = Ë P (xt1)/ Ë P (xtâ1 1 ), t = 1, . . . , T. Therefore, in the rest of the paper we interchangeably consider Ë P as a conditional distribu- tion or as a complete (consistent) distribution of the test sequence xT1 . The log-loss has various meanings. Perhaps the most important one is found in its equivalence to lossless compression. The quantity â log Ë P (xi|x1 · · · xiâ1), which is also called the âself-informationâ, is the ideal compression or âcode lengthâ of xi, in bits per symbol,with respect to the conditional distribution Ë P (X|x1 · · · xiâ1), and can be implemented online (with arbitrarily small redundancy) using arithmetic encoding (Rissanen & Langdon, 1979). 387
Begleiter, El-Yaniv & Yona Thus, the average log-loss also measures the average compression rate of the test sequence,when using the predictions generated by Ë P . In other words, a small average log-loss over the xT1 sequence implies a good compression of this sequence.2 Within a probabilistic setting, the log-loss has the following meaning. Assume that the training and test sequences were emitted from some unknown probabilistic source P . Letthe test sequence be given by the value of the sequence of random variables XT 1 = X1 · · · XT . A well known fact is that the distribution P uniquely minimizes the mean log-loss; that is,3 P = arg min âEP log Ë P (XT 1 ) . Ë P Due to the equivalence of the log-loss and compression, as discussed above, the mean ofthe log-loss of P (under the true distribution P ) achieves the best possible compression, orthe entropy HT (P ) = âE log P (XT1 ). However, the true distribution P is unknown and the learner generates the proxy Ë P using the training sequence. The extra loss (due to the use of Ë P , instead of P ) beyond the entropy, is called the redundancy and is given by DT (P || Ë P ) = EP â log Ë P (XT 1 ) â (â log P (X T 1 )) . (2) It is easy to see that DT (P || Ë P ) is the (T th order) Kullback-Leibler (KL) divergence (see Cover & Thomas, 1991, Sec. 2.3). The normalized redundancy DT (P || Ë P )/T (of a sequence of length T ) gives the extra bits per symbol (over the entropy rate) when compressing asequence using Ë P . This probabilistic setting motivates a desirable goal, when devising a general-purpose prediction algorithm: minimize the redundancy uniformly, with respect to all possible dis-tributions. A prediction algorithm for which we can bound the redundancy uniformly, withrespect to all distributions in some given class, is often called universal with respect tothat class. A lower bound on the redundancy of any universal prediction (and compression)algorithm is â¦(K( log T )), where K is (roughly) the number of parameters of the model 2T encoding the distribution Ë P (Rissanen, 1984).4 Some of the algorithms we discuss below are universal with respect to some classes of sources. For example, when an upper boundon the Markov order is known, the ctw algorithm (see below) is universal (with respect toergodic and stationary sources), and in fact, has bounded redundancy, which is close to theRissannen lower bound. 3. VMM Prediction Algorithms In order to assess the signiï¬cance of our results it is important to understand the sometimessubtle diï¬erences between the diï¬erent algorithms tested in this study and their variations.In this section we describe in detail each one of these six algorithms. We have adhered toa uniï¬ed notation schema in an eï¬ort to make the commonalities and diï¬erences betweenthese algorithms clear. 2. The converse is also true: any compression algorithm can be translated into probability assignment Ë P ; see, e.g., (Rissanen, 1984). 3. Note that very speciï¬c loss functions satisfy this property. (See Miller et al., 1993).4. A related lower bound in terms of channel capacity is given by Merhav and Feder (1995). 388
On Prediction Using Variable Order Markov Models Most algorithms for learning VMMs include three components: counting, smoothing (probabilities of unobserved events) and variable-length modeling. Speciï¬cally, all such al-gorithms base their probability estimates on counts of the number of occurrences of symbolsÏ appearing after contexts s in the training sequence. These counts provide the basis forgenerating the predictor Ë P . The smoothing component deï¬nes how to handle unobserved events (with zero value counts). The existence of such events is also called the âzero fre-quency problemâ. Not handling the zero frequency problem is harmful because the log-lossof an unobserved but possible event, which is assigned a zero probability by Ë P , is inï¬nite. The algorithms we consider handle such events with various techniques. Finally, variable length modeling can be done in many ways. Some of the algorithms discussed here construct a single model and some construct several models and average them.The models themselves can be bounded by a pre-determined constant bound, which meansthat the algorithm does not consider contexts that are longer than the bound. Alternatively,models may not be bounded a-priori, in which case the maximal context size is data-driven. There are a great many VMM prediction algorithms. In fact, any lossless compression algorithm can be used for prediction. For the present study we selected six VMM predictionalgorithms, described below. We attempted to include algorithms that are considered to betop performers in lossless compression. We, therefore, included the âcontext tree weighting(CTW)â and âprediction by partial match (PPM)â algorithms. The âprobabilistic suï¬x tree(PST)â algorithm is well known in the machine learning community. It was successfullyused in a variety of applications and is hence included in our set. To gain some perspectivewe also included the well known lz78 (prediction) algorithm that forms the basis of manycommercial applications for compression. We also included a recent prediction algorithmfrom Nisenson et al. (2003) that is a modiï¬cation of the lz78 prediction algorithm. Thealgorithms we selected are quite diï¬erent in terms of their implementations of the threecomponents described above. In this sense, they represent diï¬erent approaches for VMMprediction.5 3.1 Lempel-Ziv 78 (LZ78) The lz78 algorithm is among the most popular lossless compression algorithms (Ziv & Lem-pel, 1978). It is used as the basis of the Unix compress utility and other popular archivingutilities for PCs. It also has performance guarantees within several analysis models. Thisalgorithm (together with the lz77 compresion method) attracted enormous attention andinspired the area of lossless compression and sequence prediction. The prediction component of this algorithm was ï¬rst discussed by Langdon (1983) and Rissanen (1983). The presentation of this algorithm is simpliï¬ed after the well-known lz78compression algorithm, which works as follows, is understood. Given a sequence qn1 â Σn, lz78 incrementally parses qn1 into non-overlapping adjacent âphrasesâ, which are collected into a phrase âdictionaryâ. The algorithm starts with a dictionary containing the emptyphrase . At each step the algorithm parses a new phrase, which is the shortest phrasethat is not yet in the dictionary. Clearly, the newly parsed phrase s extends an existing 5. We did not include in the present work the prediction algorithm that can be derived from the more recent bzip compression algorithm (see http://www.digistar.com/bzip2), which is based on the successfulBurrows-Wheeler Transform (Burrows & Wheeler, 1994; Manzini, 2001). 389
Begleiter, El-Yaniv & Yona dictionary phrase by one symbol; that is, s = sÏ, where s is already in the dictionary(s can be the empty phrase). For compression, the algorithm encodes the index of s(among all parsed phrases) followed by a ï¬xed code for Ï. Note that coding issues will notconcern us in this paper. Also observe that lz78 compresses sequences without explicitprobabilistic estimates. Here is an example of this lz78 parsing: if q11 1 = abracadabra, then the parsed phrases are a|b|r|ac|ad|ab|ra. Observe that the empty sequence is always in the dictionary and is omitted in our discussions. An lz78-based prediction algorithm was proposed by Langdon (1983) and Rissanen (1983). We describe separately the learning and prediction phases.6 For simplicity we ï¬rstdiscuss the binary case where Σ = 0, 1, but the algorithm can be naturally extended toalphabets of any size (and in the experiments discussed below we do use the multi-alphabetalgorithm). In the learning phase the algorithm constructs from the training sequence qn1 a binary tree (trie) that records the parsed phrases (as discussed above). In the tree wealso maintain counters that hold statistics of qn1. The initial tree contains a root and two (left and right) leaves. The left child of a node corresponds to a parsing of â0â and the rightchild corresponds to a parsing of â1â. Each node maintains a counter. The counter in a leafis always set to 1. The counter in an internal node is always maintained so that it equalsthe sum of its left and right child counters. Given a newly parsed phrase s , we start at theroot and traverse the tree according to s (clearly the tree contains a corresponding path,which ends at a leaf). When reaching a leaf, the tree is expanded by making this leaf aninternal node and adding two leaf-sons to this new internal node. The counters along thepath to the root are updated accordingly. To compute the estimate Ë P (Ï|s) we start from the root and traverse the tree according to s. If we reach a leaf before âconsumingâ s we continue this traversal from the root,etc. Upon completion of this traversal (at some internal node, or a leaf) the prediction forÏ = â0â is the â0â (left) counter divided by the sum of â0â and â1â counters at that node, etc. For larger alphabets, the algorithm is naturally extended such that the phrases are stored in a multi-way tree and each internal node has exactly k = |Σ| children. In addition,each node has k counters, one for each possible symbol. In Figure 1 we depict the resultingtree for the training sequence q11 1 = abracadabra and calculate the probability Ë P (b|ab), assuming Σ = a, b, c, d, r. Several performance guarantees were proven for the lz78 compression (and prediction) algorithm. Within a probabilistic setting (see Section 2), when the unknown source isstationary and ergodic Markov of ï¬nite order, the redundancy is bounded above by (1/ ln n)where n is the length of the training sequence (Savari, 1997). Thus, the lz78 algorithm isa universal prediction algorithm with respect to the large class of stationary and ergodicMarkov sources of ï¬nite order. 3.2 Prediction by Partial Match (PPM) The Prediction by Partial Match (ppm) algorithm (Cleary & Witten, 1984) is considered tobe one of the best lossless compression algorithms.7 The algorithm requires an upper bound 6. These âphasesâ can be combined and operated together online.7. Speciï¬cally, the ppm-ii variant of ppm currently achieves the best compression rates over the standard Calgary Corpus benchmark (Shkarin, 2002). 390
On Prediction Using Variable Order Markov Models (a, 1) (a, 1) (b, 1) (b, 5) (c, 1) (d, 1) (r, 1) (a, 1) (b, 1) (c, 5) (c, 1) (a, 17) (d, 1) (r, 1) (a, 1) (b, 1) (c, 1) (d, 5) (d, 1) (r, 1) (r, 1) (a, 1) (b, 1) (b, 5) (c, 1) (d, 1) (r, 1) (c, 1) (d, 1) (a, 1) (b, 1) (a, 5) (c, 1) (d, 1) (r, 1) (b, 1) (r, 9) (c, 1) (d, 1) (r, 1) Figure 1: The tree constructed by lz78 using the training sequence q11 1 = abracadabra. The algorithm extracts the set a,b,r,ac,ad,ab,ra of âphrasesâ (this set istraditionally called a âdictionaryâ) and constructs the phrase tree accordingly.Namely, there is an internal node for every string in the dictionary, and everyinternal node has all |Σ| children. For calculating Ë P (b|ar) we traverse the tree as follows: âaârâ âb and conclude with Ë P (b|ar) = 5/33 = 0.152. To calculate Ë P (d|ra) we traverse the tree as follow: ârâaâd and conclude with Ë P (d|ra) = 1/5 = 0.2. D on the maximal Markov order of the VMM it constructs. ppm handles the zero frequencyproblem using two mechanisms called escape and exclusion. There are several ppm variantsdistinguished by the implementation of the escape mechanism. In all variants the escapemechanism works as follows. For each context s of length k ⤠D, we allocate a probabilitymass Ë Pk(escape|s) for all symbols that did not appear after the context s (in the training sequence). The remaining mass 1 â Ë Pk(escape|s) is distributed among all other symbols that have non-zero counts for this context. The particular mass allocation for âescapeâ andthe particular mass distribution Ë Pk(Ï|s), over these other symbols Ï, determine the ppm 391
Begleiter, El-Yaniv & Yona variant. The mechanism of all ppm variants satisï¬es the following (recursive) relation,   Ë if sn Ï apeared in nâD+1 Ë P ), P (Ï|sn D(Ï|sn nâD+1 the training sequence; nâD+1) = (3)  Ë PD(escape|sn ) · Ë P (Ï|sn ), nâD+1 nâD+2 otherwise. For the empty context , ppm takes Ë P (Ï| ) = 1/|Σ|.8 The exclusion mechanism is used to enhance the escape estimation. It is based on the observation that if a symbol Ï appears after the context s of length k ⤠D, there is noneed to consider Ï as part of the alphabet when calculating Ë Pk(·|s ) for all s suï¬x of s (see Equation (3)). Therefore, the estimates Ë Pk(·|s) are potentially more accurate since they are based on a smaller (observed) alphabet. The particular ppm variant we consider here is called âMethod Câ (ppm-c) and is deï¬ned as follows. For each sequence s and symbol Ï let N (sÏ) denote the number of occurrencesof sÏ in the training sequence. Let Σs be the set of symbols appearing after the context s(in the training sequence); that is, Σs = Ï : N(sÏ) > 0. For ppm-c we thus have Ë N (sÏ) Pk(Ï|s) = , if Ï â Σ |Σ s; (4) s| + N (sÏ ) Ï âΣs Ë |Σ P s| k(escape|s) = , (5) |Σs| + N (sÏ ) Ï âΣs where we assume that |s| = k. This implementation of the escape mechanism by Moï¬at(1990) is considered to be among the better ppm variants, based on empirical examination(Bunton, 1997). As noted by Witten and Bell (1991) there is no principled justiï¬cation forany of the various ppm escape mechanisms. One implementation of ppm-c is based on a trie data structure. In the learning phase the algorithm constructs a trie T from the training sequence qn1. Similar to the lz78 trie, each node in T is associated with a symbol and has a counter. In contrast to the unbounded lz78trie, the maximal depth of T is D + 1. The algorithm starts with a root node correspondingto the empty sequence and incrementally parses the training sequence, one symbol at a time. Each parsed symbol qi and its D-sized context, xiâ1 , deï¬ne a potential path iâD in T , which is constructed, if it does not yet exist. Note that after parsing the ï¬rst Dsymbols, each newly constructed path is of length D + 1. The counters along this path areincremented. Therefore, the counter of any node, with a corresponding path sÏ (where Ïis the symbol associated with the node) is N (sÏ). Upon completion, the resulting trie induces the probability Ë P (Ï|s) for each symbol Ï and context s with |s| ⤠D. To compute Ë P (Ï|s) we start from the root and traverse the tree according to the longest suï¬x of s, denoted s , such that s Ï corresponds to a completepath from the root to a leaf. We then use the counters N (s Ï ) to compute Σs and theestimates as given in Equations (3), (4) and (5). The above implementation (via a trie) is natural for the learning phase and it is easy to see that the time complexity for learning qn1 is O(n) and the space required for the 8. It also makes sense to use the frequency count of symbols over the training sequence. 392
On Prediction Using Variable Order Markov Models worst case trie is O(D · n). However, a straightforward computation of Ë P (Ï|s) using the trie incurs O(|s|2) time.9 In Figure 2 we depict the resulting tree for the training sequenceq11 1 = abracadabra and calculate the probabilities Ë P (b|ar) and Ë P (d|ra), assuming Σ = a,b,c,d,r. (b, 2) (r, 2) (c, 1) (a, 1) (Ï = a, N (sÏ) = 5) (d, 1) (a, 1) (r, 2) (b, 2) (a, 2) (a, 1) (c, 1) (d, 1) (a, 1) (b, 1) (d, 1) (a, 1) (c, 1) (r, 1) Figure 2: The tree constructed by ppm-c using the training sequence q11 1 = abracadabra. Ë P (d|ra) = Ë P (escape|ra) · Ë P (d|a) = 1 · 1 = 0.07 without using the exclusion 2 4+3 mechanism. When using the exclusion, observe that N (rac) > 0; therefore, wecan exclude c when evaluating Ë P (d|ra) = Ë P (escape|ra) · Ë P (d|a) = 1 · 1 = 0.08. 2 6 3.3 The Context Tree Weighting Method (CTW) The Context Tree Weighting Method (ctw) algorithm (Willems et al., 1995) is a stronglossless compression algorithm that is based on a clever idea for combining exponentiallymany VMMs of bounded order.10 In this section we consider the original ctw algorithmfor binary alphabets. In the next section we discuss extensions for larger alphabets. If the unknown source of the given sample sequences is a âtree sourceâ of bounded order, then the compression redundancy of ctw approaches the Rissanen lower bound atrate 1/n.11 A D-bounded tree source is a pair (M, ÎM ) where M is set of sequences of 9. After the construction of the trie it is possible to generate a corresponding suï¬x tree that allows a rapid computation of Ë P (Ï|s) in O(|s|) time. It is possible to generate this suï¬x tree in time proportional to the sum of all paths in the trie. 10. The paper presenting this idea (Willems et al., 1995) received the 1996 Paper Award of the IEEE Information Theory Society. 11. Speciï¬cally, for Σ = 0, 1, Rissanenâs lower bound on the redundancy when compressing xn1 is â¦(|M |( log n )). The guaranteed ctw redundancy is O(|M |( log n ) + 1 |M | log |M |), where M is the suï¬x 2n 2n n set of the unknown source. 393
Begleiter, El-Yaniv & Yona length ⤠D. This set must be a complete and proper suï¬x set, where âcompleteâ meansthat for any sequence x of length ⥠D, there exists s â M , which is a suï¬x of x. âProperâmeans that no sequence in M has a strict suï¬x in M . The set ÎM contains one probabilitydistribution over Σ for each sequence in M . A proper and complete suï¬x set is called amodel. Given some initial sequence (âstateâ) s â M , the tree source can generate a randomsequence x by continually drawing the next symbol from the distribution associated withthe unique suï¬x of x. Any full binary tree of height ⤠D corresponds to one possible D-bounded model M . Any choice of appropriate probability distributions ÎM deï¬nes, together with M, a uniquetree source. Clearly, all prunings of the perfect binary tree12 of height D that are full binarytrees correspond to the collection of all D-bounded models. For each model M , ctw estimates a set of probability distributions, ÎM = Ë PM (·|s)sâM , such that (M, ÎM ) is a tree source. Each distribution Ë PM (·|s) is a smoothed version of a maximum likelihood estimate based on the training sequence. The core idea of the ctwalgorithm is to generate a predictor that is a mixture of all these tree sources (correspondingto all these D-bounded models). Let MD be the collection of all D-bounded models. Forany sequence xT1 the ctw estimated probability for xT1 is given by Ë Pctw(xT1 ) = w(M ) · Ë PM (xT1 ); (6) M âMD T Ë P Ë M (xT 1 ) = PM (xi|suï¬xM (xiâ1 )), (7) iâD i=1 where suï¬xM (x) is the (unique) suï¬x of x in M and w(M) is an appropriate probabilityweight function. Clearly, the probability Ë PM (xT1 ), given in (7), is a product of independent events, by the deï¬nition of a tree source. Also note that we assume that the sequence xT1 has an âhistoricalâ context deï¬ning the conditional probabilities Ë P (xi|suï¬xM (xiâ1 )) of the iâD ï¬rst symbols. For example, we can assume that the sequence xT1 is a continuation of the training sequence. Alternatively, we can ignore (up to) the ï¬rst D symbols of xT1 . A more sophisticated solution is proposed by Willems (1998); see also Section 7. Note that there is a huge (exponential in D) number of elements in the mixture (6), corresponding to all bounded models. To describe the ctw prediction algorithm we needto describe (i) how to (eï¬ciently) estimate all the distributions Ë PM (Ï|s); (ii) how to select the weights w(M ); and (iii) how to eï¬ciently compute the ctw mixture (6). For each model M , ctw computes the estimate Ë PM (·|s) using the Krichevsky-Troï¬mov (KT) estimator for memoryless binary sources (Krichevsky & Troï¬mov, 1981). The KT-estimator enjoys an optimal proven redundancy bound of 2+log n . This estimator, which is 2n very similar to Laplaceâs law of succession, is also very easy to calculate. Given a trainingsequence q = , the KT-estimator, based on q, is equivalent to Ë N P 0(q) + 1 2 kt(0|q) = ; N0(q) + N1(q) + 1 Ë Pkt(1|q) = 1 â Ë Pkt(0|q), 12. In a perfect binary tree all internal nodes have two sons and all leaves are at the same level. 394
On Prediction Using Variable Order Markov Models where N0(q) = N(0) and N1(q) = N(1). That is, Ë Pkt(0|q) is the KT-estimated probability for â0â, based on a frequency count in q.13 For any sequence s deï¬ne subs(q) to be the unique (non-contiguous) sub-sequence of q consisting of the symbols following the occurrences of s in q. Namely, it is the concatenationof all symbols Ï (in the order of their occurrence in q) such that sÏ is a substring of q. Forexample, if s = 101 and q = 101011010, then sub101(q) = 010. Using this deï¬nition wecan calculate the KT-estimated probability of a test sequence xT1 according to a context s, based on the training sequence q. Letting qs = subs(q) we have, T Ë P s Ë kt(xT 1 |q) = Pkt(xi|qs). i=1 For the empty sequence we have Ë P skt( |q) = 1. One of the main technical ideas in the ctw algorithm is an eï¬cient recursive computa- tion of Equation (6). This is achieved through the following recursive deï¬nition. For any ssuch that |s| ⤠D, deï¬ne 1 Ë Ë Ë P s P 0s P s 2 kt(xT 1 |q) + 1 2 ctw(xT 1 ) Ë P 1s ctw(xT 1 ), if |s| < D; ctw(xT 1 ) = Ë (8) P skt(xT1 |q), otherwise (|s| = D), where q is, as usual, the training sequence. As shown by Willems et al. (1995, Lemma 2),for any training sequence q and test sequence x, Ë Pctw(x) is precisely the ctw mixture as given in Equation (6), where the probability weights w(M ) are14 w(M ) = 2âCD(M); CD(M) = | s â M | â 1 + |s â M : |s| < D|. In the learning phase, the ctw algorithm processes a training sequence qn1 and con- structs a binary context tree T . An outgoing edge to a left son in T is labelled by â0â andan outgoing edge to a right son, by â1â. Each node s â T is associated with the sequencecorresponding to the path from this node to the root. This sequence (also called a âcontextâ)is also denoted by s. Each node maintains two counters N0 and N1 that count the numberof â0âs (resp. â1âs) appearing after the contexts s in the training sequence. We start with aperfect binary tree of height D and set the counters in each node to zero. We process thetraining sequence qn1 by considering all n â D contexts of size D and for each context we update the counters of the nodes along the path deï¬ned by this context. Upon completionwe can estimate the probability of a test sequence xT1 using the relation (8) by computing Ë Pctw(xT1 ). This probability estimate for the sequence xT1 induces the conditional distribu- tions Ë Pctw(xi|xiâ1 ), as noted in Section 2. For example, in Figure 3 we depict the ctw iâD+1 tree with D = 2 constructed according to the training sequence q91 = 101011010. 13. As shown by Tjalkens et al. (1997), the above simple form of the KT estimator is equivalent to the original form, given in terms of a Dirichlet distribution by Krichevsky and Troï¬mov (1981). The (original) KT-estimated probability of a sequence containing N (1âθ)N0 θN1 0 zeros and N1 ones is 1 â dθ. 0 Ï Î¸(1âθ) 14. There is a natural coding for D-bounded models such that the length of the code in bits for a model M is exactly CD(M). 395
Begleiter, El-Yaniv & Yona This straightforward implementation requires O(|Σ|D) space (and time) for learning and O(T · |Σ|D) for predicting xT1 . Clearly, such time and space complexities are hopeless even for moderate D values. We presented the algorithm this way for simplicity. However, asalready mentioned by Willems et al. (1995), it is possible to obtain linear time complexitiesfor both training and prediction. The idea is to only construct tree paths that occur inthe training sequence. The counter values corresponding to all nodes along unexploredpaths remain zero and, therefore, the estimated ctw probabilities for entire unvisited sub-trees can be easily computed (using closed-form formulas). The resulting time (and space)complexity for training is O(Dn) and for prediction, O(T D). More details on this eï¬cientimplementation are nicely explained by Sadakane et al. (2000) (see also Tjalkens & Willems,1997). s N Ë 0 N1 P s kt (q) Ë P s ctw (q) 0 3 4 63/7680 21/1024 0 0 0 3 21/64 21/64 1 3 1 7/160 3/80 1 00 0 0 1 1 0 10 0 3 21/64 21/64 1 01 2 1 1/16 1/16 11 1 0 1/2 1/2 1 Figure 3: A ctw tree (D = 2) corresponding to the training sequence q91 = 101011010, where the ï¬rst two symbols in q (q21) are neglected. We present, for each tree node, the values of the counters N0, N1 and the estimations Ë Pkt(0|·) and Ë Pctw(0|·). For example, for the leaf 10: N0 = 0 since 0 does not appear af-ter 10 in q91 = 101011010; 1 appears (in q91) three times after 10, therefore,N1 = 3; Ë Pkt(0|10) = 0+12 = 1; because 10 is a leaf, according to Equation (8) 0+3+1 8 Ë Pkt(0|10) = Ë Pctw(0|10). 3.4 CTW for Multi-Alphabets The above ctw algorithm for a binary alphabet can be extended in a natural manner tohandle sequences over larger alphabets. One must (i) extend the KT-estimator for largeralphabets; and (ii) extend the recurrence relation (8). While these extensions can be easilyobtained, it is reported that the resulting algorithm preforms poorly (see Volf, 2002, chap. 4).As noted by Volf, the reason is that the extended KT-estimator does not provide eï¬cientsmoothing for large alphabets. Therefore, the problem of extending the ctw algorithm forlarge alphabets is challenging. Several ctw extensions for larger alphabets have been proposed. Here we consider two. The ï¬rst is a naive application of the standard binary ctw algorithm over a binaryrepresentation of the sequence. The binary representation is naturally obtained when thesize of the alphabet is a power of 2. Suppose that k = log2 |Ï| is an integer. In thiscase we generate a binary sequence by concatenating binary words of size k, one for eachalphabet symbol. If log2 |Ï| is not an integer we take k = log2 |Ï| . We denote the resulting 396
On Prediction Using Variable Order Markov Models algorithm by bi-ctw. A more sophisticated binary decomposition of ctw was consideredby Tjalkens et al. (1997). There eight binary machines were simultaneously constructed,one for each of the eight binary digits of the ascii representation of text. This decompositiondoes not achieve the best possible compression rates over the Calgary Corpus. The second method we consider is Volfâs âdecomposed ctwâ, denoted here by de-ctw (Volf, 2002). The de-ctw uses a tree-based hierarchical decomposition of the multi-valuedprediction problem into binary problems. Each of the binary problems is solved via a slightvariation of the binary ctw algorithm. Let Σ be an alphabet with size k = |Σ|. Consider a full binary tree TΣ with k leaves. Each leaf is uniquely associated with a symbol in Σ. Each internal node v of TΣ deï¬nes thebinary problem of predicting whether the next symbol is a leaf on vâs left subtree or a leafon vâs right subtree. For example, for Σ = a,b,c,d,r, Figure 4 depicts a tree TΣ suchthat the root corresponds to the problem of predicting whether the next symbol is a or oneof b,c,d and r. The idea is to learn a binary predictor, based on the ctw algorithm, foreach internal node. In a simplistic implementation of this idea, we construct a binary ctw for each internal node v â TΣ. We project the training sequence over the ârelevantâ symbols (i.e., corre-sponding to the subtree rooted by v) and translate the symbols on vâs left (resp., right)sub-tree to 0s (resp., 1s). After training we predict the next symbol Ï by assigning eachsymbol a probability that is the product of binary predictions along the path from the rootof TΣ to the leaf labeled by Ï. Unfortunately, this simple approach may result in poor performance and a slight mod- iï¬cation is required.15 For each internal node v â TΣ, let ctwv be the associated binarypredictor designed to handle the alphabet Σv â Σ. Instead of translating the projectedsequence (projected over Σv) to a binary sequence, as in the simple approach, we constructctwv based on a |Σv|-ary context tree. Thus, ctwv still generates predictions over a bi-nary alphabet but expands a suï¬x tree using the (not necessarily binary) Σv alphabet. Togenerate binary predictions ctwv utilizes, in each node of the context tree, two countersfor computing binary KT-estimations (as in the standard ctw algorithm). For example, inFigure 4(b) we depict ctw3 whose binary problem is deï¬ned by TΣ of Figure 4(a). Anothermodiï¬cation suggested by Volf, is to use a variant on the KT-estimator. Volf shows thatthe estimator Pe(0|q) = N0(q)+ 12 achieves better results in practice. We used this N0(q)+N1(q)+1/8 estimator for the de-ctw implementation.16 A major open issue when applying the de-ctw algorithm is the construction of an eï¬ective decomposition tree TΣ. Following (Volf, 2002, Chapter 5), we employ the heuristicnow detailed. Given a training sequence qn1, the algorithm takes TΣ to be Hoï¬manâs code- tree of qn1 (see Cover & Thomas, 1991, chapter 5.6) based on the frequency counts of the symbols in qn1. 15. Consider the following example. Let Σ = a,b,z and consider a decomposition TΣ with two internal nodes where the binary problem of the root discriminates between a,b and z (and the other internalnode corresponds to distinguishing between a and b). Using the simplistic approach, if we observe thetraining sequence azaz · · · azazb, we will assign very high probability to z given the context zb, which isnot necessarily supported by the training data. 16. We also used this estimator in the protein classiï¬cation experiment. 397
Begleiter, El-Yaniv & Yona The time and space complexity of the de-ctw algorithm, based on the eï¬cient imple- mentation of the binary ctw, are O(|Σ|Dn) for training and and O(|Σ|DT ) for prediction. ctw4 ctw3 ctw2 predictor training ctw1 sequence DCTW abracadabra c ctw1 abracadabra ctw d 2 brcdbr right ctw r 3 rcdr ctw b 4 cd a left (a) c d CT W3 (using q = rcdr) s N N Ë P s r c,d r kt (q) Ë P s ctw (q) c 1 1 1/2 1/2 c 1 0 1/2 1/2 c d 0 1 1/2 1/2 d r 0 0 1 1 rc 1 0 1/2 1/2 d cd 0 1 1/2 1/2 r cc 0 0 1 1 .. . . . . . .. .. .. .. r c rr 0 0 1 1 d r (b) Figure 4: A de-ctw tree corresponding to the training sequence q11 1 = abracadabra. (a) depicts the decomposition tree T . Each internal node in T uses a ctw predictorto âsolveâ a binary problem. In (b) we depict ctw3 whose binary problem is:âdetermine if Ï â c,d (or Ï = r)â. Tree paths of contexts that do not appearin the training sequence are marked with dashed lines. 3.5 Probabilistic Suï¬x Trees (PST) The Probabilistic Suï¬x Tree (pst) prediction algorithm (Ron et al., 1996) attempts toconstruct the single âbestâ D-bounded VMM according to the training sequence. It isassumed that an upper bound D on the Markov order of the âtrue sourceâ is known to thelearner. A pst over Σ is a non empty rooted tree, where the degree of each node varies between zero (for leaves) and |Σ|. Each edge in the tree is associated with a unique symbol in Σ.These edge labels deï¬ne a unique sequence s for each path from a node v to the root. Thesequence s labels the node v. Any such pst tree induces a âsuï¬x setâ S consisting of the 398
On Prediction Using Variable Order Markov Models labels of all the nodes. The goal of the pst learning algorithm is to identify a good suï¬xset S for a pst tree and to assign a probability distribution Ë P (Ï|s) over Σ, for each s â S. Note that a pst tree may not be a tree source, as deï¬ned in Section 3.3. The reason is thatthe set S is not necessarily proper. We describe the learning phase of the pst algorithm,which can be viewed as a three stage process. 1. First, the algorithm extracts from the training sequence qn1 a set of candidate contexts and forms a candidate âsuï¬x setâ Ë S. Each contiguous sub-sequence s of qn1 of length ⤠D, such that the frequency of s in qn1 is larger than a user threshold will be in Ë S. Note that this construction guarantees that Ë S can be accommodated in a (pst) tree (if a context s appears in Ë S, then all suï¬xes of s must also be in Ë S). For each candidate s we associate the conditional (estimated) probability Ë P (Ï|s), based on a direct maximum likelihood estimate (i.e., a frequency count). 2. In the second stage, each s in the candidate set Ë S is examined using a two-condition test. If the context s passes the test, s and all its suï¬xes are included in the ï¬nal psttree. The test has two conditions that must hold simultaneously: (i) The context s is âmeaningfulâ for some symbol Ï; that is, Ë P (Ï|s) is larger than a user threshold. (ii) The context s contributes additional information in predicting Ï relative to its âparentâ; if s = ÏkÏkâ1 · · · Ï1, then its âparentâ is its longest suï¬x s = Ë Ï P (Ï|s) kâ1 · · · Ï1. We require that the ratio be larger than a user threshold r or Ë P (Ï|s ) smaller than 1/r. Note that if a context s passes the test there is no need to examine any of its suï¬xes(which are all added to the tree as well). 3. In the ï¬nal stage, the probability distributions associated with the tree nodes (con- texts) are smoothed. If Ë P (Ï|s) = 0, then it is assigned a minimum probability (a user-deï¬ned constant) and the resulting conditional distribution is re-normalized. Altogether the pst learning algorithm has ï¬ve user parameters, which should be selectedbased on the learning problem at hand. The pst learning algorithm has a PAC styleperformance guarantee ensuring that with high probability the redundancy approaches zeroat rate 1/n1/c for some positive integer c.17 An example of the pst tree constructed for theabracadabra training sequence is given in Figure 5. The pst learning phase has a time complexity of O(Dn2) and a space complexity of O(Dn). The time complexity for calculating Ë P (Ï|s) is O(D). 3.6 LZ-MS: An Improved Lempel-Ziv Algorithm There are plenty of variations on the classic lz78 compression algorithm (see Bell et al.,1990, for numerous examples). Here we consider a recent variant of the lz78 prediction 17. Speciï¬cally, the result is that, with high probability (⥠1 â δ), the Kullback-Leibler divergence between the pst estimated probability distribution and the true underlying (unknown) VMM distribution is notlarger than ε after observing a training sequence whose length is polynomial in 1 , 1 , D, |Σ| and the ε δ number of states in the unknown VMM model. 399
Begleiter, El-Yaniv & Yona ca (.001, .001, .001, .996, .001) da a (.001, .996, .001, .001, .001) (.001, .498, .25, .25, .001) ra (.001, .001, .996, .001, .001) b (.001, .001, .001, .001, .996) (.45, .183, .092, .092, .183) c (.996, .001, .001, .001, .001) d (.996, .001, .001, .001, .001) r (.996, .001, .001, .001, .001) Figure 5: A pst tree corresponding to the training sequence x11 1 = abracadabra and the pa- rameters (Pmin = 0.001, γmin = 0.001, α = 0.01, r = 1.05, D = 12). Each node islabelled with a sequence s and the distribution ( Ë Ps(a), Ë Ps(b), Ë Ps(c), Ë Ps(d), Ë Ps(r)). algorithm due to Nisenson et al. (2003). The algorithm has two parameters M and S and,therefore, its acronym here is lz-ms. A major advantage of the lz78 algorithm is its speed. This speed is made possible by compromising on the systematic counts of all sub-sequences. For example, in Figure 1, thesub-sequence br is not parsed and, therefore, is not part of the tree. However, this sub-sequence is signiï¬cant for calculating Ë P (a|br). While for a very long training sequence this compromise will not aï¬ect the prediction quality signiï¬cantly, for short training sequencesthe lz78 algorithm yields sparse and noisy statistics. Another disadvantage of the lz78algorithm is its loss of context when calculating Ë P (Ï|s). For example, in Figure 1, when taking Ï = b and s = raa, the lz78 prediction algorithm will compute Ë P (b|raa) by travers- ing the tree along the raa path (starting from the root) and will end on a leaf. Therefore, Ë P (b|raa) = Ë P (b| ) = 5/33. Nevertheless, we could use the suï¬x a of raa to get the more accurate value Ë P (b|a) = 5/17. These are two major deï¬ciencies of lz78. The lz-ms al- gorithm attempts to overcome both these disadvantages by introducing two correspondingheuristics. This algorithm improves the lz78 predictions by extracting more phrases duringlearning and by ensuring a minimal context for the next phrase, whenever possible. The ï¬rst modiï¬cation is termed input shifting. It is controlled by the S parameter and used to extract more phrases from the training sequence. The training sequence qn1 = q1q2 · · · qn is parsed S + 1 times, where in the ith parsing the algorithm learns the sequenceqiqi+1 · · · qn using the standard lz78 learning procedure. The newly extracted phrases andcounts are combined with the previous ones (using the same tree). Note that by taking 400
On Prediction Using Variable Order Markov Models S = 0 we leave the lz78 learning algorithm intact. For example, the second row of Table 1illustrates the eï¬ect of input shifting with S = 1, which clearly enriches the set of parsedphrases. The second modiï¬cation is termed back-shift parsing and is controlled by the M pa- rameter. Back-shift parsing attempts to guarantee a minimal context of M symbols whencalculating probabilities. In the learning phase the algorithm back-shifts M symbols aftereach phrase extraction. To prevent the case where more symbols are back-shifted thanparsed (which can happen when M > 1), the algorithm requires that back shifting remainwithin the last parsed phrase. For example, on the third row of Table 1 we see the resultingphrase set extracted from the training sequence using M = 1. Observe that after extract-ing the ï¬rst phrase (a) lz-ms back-shifts one symbol, therefore, the next parsed symbol is(again) âaâ. This implies that the next extracted phrase is ab. Since lz-ms back-shifts aftereach phrase extraction we conclude with the resulting extracted phrase set. The back-shift parsing mechanism may improve the calculation of conditional proba- bilities, which is also modiï¬ed as follows. Instead of returning to the root after travers-ing to a leaf, the last M -traversed symbols are ï¬rst traced down from the root to somenode v, and then the new traversal begins from v (if v exists). Take, for example, thecalculation of Ë P (b|raa) using the tree in Figure 1. If we take M = 0, then we end with Ë P (b|raa) = Ë P (b| ) = 5/33; on the other hand, if we take M = 1, we end with Ë P (b|raa) = Ë P (b|a) = 5/17. lz78(M, S) Phrases parsed from abracadabra lz78(0, 0) = lz78 a,b,r,ac,ad,ab,ra lz78(0, 1) a,b,r,ac,ad,ab,ra,br,aca,d,abr lz78(1, 0) a,ab,b,br,r,ra,ac,c,ca,ad,d,da,abr lz78(1, 1) a,ab,b,br,r,ra,ac,c,ca,ad,d,da,abr,bra,aca,ada,abra lz78(2, 0) a,ab,abr,b,br,bra,r,ra,rac,ac,aca,c,ca,cad,ad,ada,d,da,dab,abra lz78(2, 1) a,ab,abr,b,br,bra,r,ra,rac,ac,aca,c,ca,cad,ad,ada,d,da,dab,abra,brac,acad,adab lz78(2, 2) a,ab,abr,b,br,bra,r,ra,rac,ac,aca,c,ca,cad,ad,ada,d,da,dab,abra,brac,acad,adab,raca,cada,dabr Table 1: Phrases parsed from q11 1 = abracadabra by lz-ms for diï¬erent values of M and S. Note that the phrases appear in the order of their parsing. Both âinput shiftingâ and âback-shift parsingâ tend to enhance the statistics we extract from the training sequence. In addition, back-shift parsing tends to enhance the utilizationof the extracted statistics. Table 1 shows the phrases parsed from q11 1 = abracadabra by lz-ms for several values of M and S. The phrases appear in the order of their parsing.Observe that each of these heuristics can introduce a slight bias into the extracted statistics. It is interesting to compare the relative advantage of input shifting and back-shift pars- ing. In Appendix C we provide such a comparison using a prediction simulation over theCalgary Corpus. The results indicate that back-shift on its own can provide more power tolz78 than input shifting parsing alone. Nevertheless, the utilization of both mechanisms isalways better than using either alone. 401
Begleiter, El-Yaniv & Yona The time complexity of the lz-ms learning algorithm is at most M S times the complexity of lz78. Prediction using lz-ms can take at most M times the prediction using lz78. 4. Experimental Setup We implemented and tested the six VMM algorithms described in Section 3 using discretesequences from three domains: text, music and proteins (see more details below). Thesource code of Java implementations for the six algorithms is available at http://www.cs.technion.ac.il/ ronbeg/vmm. We considered the following prediction setting. Thelearner is given the ï¬rst half of a sequence (e.g., a song) and is required to predict the rest.We denote the ï¬rst half by qn1 = q1 · · · qn and the second, by xn1 = x1 · · · xn. qn1 is called the training sequence and xn1 is called the test sequence. During this learning stage the learner constructs a predictor, which is given in terms of a conditional distribution Ë P (Ï|s). The predictor is then ï¬xed and tested over xn1. The quality of the predictor is measured via the average log-loss of Ë P (·|·) with respect to xn1 as given in Equation (1).18 For the protein dataset we also considered a diï¬erent, more natural setting in which the learner is providedwith a number of training sequences emanating from some unknown source, representingsome domain. The learner is then required to provide predictions for sequences from thesame domain. During training, the learner should utilize the training sequence for learning a proba- bilistic model, as well as for selecting the best possible hyper-parameters. Each algorithmselects its hyper-parameters from a cross product combination of âfeasibleâ values, usinga cross-validation (CV) scheme over the training sequence. In our experiments we useda ï¬ve-fold CV. The training sequence qn1 was segmented into ï¬ve (roughly) equal sized contiguous non-overlapping segments and each fold used one such segment for testing anda concatenation of rest of the segments for training. Before starting the experiments weselected, for each algorithm, a set of âfeasibleâ values for its vector of hyper-parameters.Each possible vector (from this set) was tested using ï¬ve-fold CV scheme (applied over thetraining sequence!) yielding ï¬ve loss estimates for this vector. The parameter vector withthe best median19 (over the ï¬ve estimates) was selected for training the algorithm over theentire training sequence. See Appendix A for more details on the exact choices of hyper-parameters for each of the algorithms. Note that our setting is diï¬erent from the standardsetup used when testing online lossless compression schemes, where the compressor startsprocessing the test sequence without any prior knowledge. Also, it is usually the case thatcompression algorithms do not optimize their hyper-parameters (e.g., for text compressionppm-c is usually applied with a ï¬xed D = 5). The sequences we consider belong to three diï¬erent domains: English text, music pieces and proteins. 18. Cover and King (1978) considered a very similar prediction game in their well-known work on the estimation of the entropy of English text. 19. We used the median rather than average since the median of a small set of numbers is considerably more robust against outliers; see, e.g., http://standards.nctm.org/document/eexamples/chap6/6.6/ 402
On Prediction Using Variable Order Markov Models Domain Sequences |Σ| Sequence Lengths Min Avg Max Text 18 256 11954 181560 785396 Music 285 256 24 12726 193662 Protein 19676 20 21 187 8599 Table 2: Some essential properties of the datasets. For the protein prediction setup we considered only sequences of size larger than 100. ⢠For the English text we chose the well-known âCalgary Corpusâ (Bell et al., 1990), which is traditionally used for benchmarking lossless compression algorithms.20 ⢠The music set was assembled from MIDI ï¬les of music pieces.21 The musical bench- mark was compiled using a variety of well-known pieces of diï¬erent styles. The styleswe included are: classical, jazz and rock/pop. All the pieces we consider are poly-phonic (played with several instruments simultaneously). Each MIDI ï¬le for a poly-phonic piece consists of several channels (usually one channel per instrument). Weconsidered each channel as an individual sequence; see Appendix B for more details. ⢠The protein set includes proteins (amino-acid sequences) from the well-known Struc- tural Classiï¬cation of Proteins (SCOP) database (Murzin et al., 1995). Here we usedall sequences in release 1.63 of SCOP, after eliminating redundancy (i.e., only onecopy was retained of sequences that are 100% identical). This set is hierarchicallyclassiï¬ed into classes according to biological considerations. We relied on this classiï¬-cation for testing classiï¬ers constructed from the VMM algorithms over a number oflarge classiï¬cation tasks (see Section 6). Table 2 summarizes some statistics of the three benchmark tests. The three datasets canbe obtained at http://www.cs.technion.ac.il/ ronbeg/vmm. 5. Prediction Results Here we present the prediction results of the six VMM algorithms for the three domains. Theperformance is measured via the log-loss as discussed above. Recall that the average log-lossof a predictor (over some sequence) is a slightly optimistic estimate of the compression rate(bits/symbol) that could be achieved by the algorithm over that sequence. Table 3 summarizes the results for the textual domain (Calgary Corpus). We see that de-ctw is the overall winner with an average loss of 3.02. The runner-up is ppm-c withan average loss of 3.03. However, there is no statistically signiï¬cant diï¬erence between the 20. The Calgary Corpus is available at ftp://ftp.cpsc.ucalgary.ca/pub/projects/text.compression.corpus. Note that this corpus contains, in addition to text, also a few binary ï¬les and some source code ofcomputer programs, written in a number of programming languages. 21. MIDI (= Musical Instrument Digital Interface) is a standard protocol and language for electronic musical instruments. A standard MIDI ï¬le includes instructions on which notes to play and how long and loudto play them. 403
Begleiter, El-Yaniv & Yona two, as indicated by standard error of the mean (SEM) values. The worst algorithm is thelz78 with an average loss of 3.76. We see that in most cases de-ctw and ppm-c share theï¬rst and second best results, lz-ms is a runner-up in ï¬ve cases, and lz78 wins one case andis a runner-up in one case. pst is the winner in a single case. It is interesting to compare the loss results of Table 3 to known compression results of (some of) the algorithms for the same corpus, which are 2.14 bps for de-ctw and 2.48bps for ppm-c (see Volf, 2002; Cleary & Teahan, 1997, respectively). While these resultsare dramatically better than the average log-loss we obtain for these algorithms, note thatthe setting is considerably diï¬erent and in the compression applications the algorithmsare continually learning the sequence. A closer inspection of this disparity between thecompression results and our training/test results reveals that the diï¬erence is in fact notthat large. Speciï¬cally, the large average log-loss is mainly due to one of the Calgary Corpusï¬les, namely the obj1 ï¬le for which the log-loss of ppm-c is 6.75, while the compressionresults of Cleary and Teahan (1997) are 3.76 bps. Without obj1 ï¬le the average log-lossobtained by ppm-c on the Calgary Corpus is 2.81. Moreover, the 2.48 result is obtained on asmaller corpus that does not include the four ï¬les paper3 -6. Without these ï¬les the averagelog-loss of ppm-c is 2.65. This small diï¬erence may indicate that the involved sequences arenot governed by a stationary distribution. Notice also that the compression results indicatethat de-ctw is signiï¬cantly better than ppm-c. However, in our setting these algorithmsperform very similarly. Next we present the results for the music MIDI ï¬les, which are summarized in Table 4. The table has four sections corresponding to classical pieces, jazz, 14 improvisations of thesame jazz piece and rock/pop. Individual average losses for these sections are speciï¬edas well. de-ctw is the overall winner and ppm-c is the overall runner-up. Both thesealgorithms are signiï¬cantly better than the rest. The worst algorithm is, again, lz78. Some observations can be drawn about the âcomplexityâ of the various pieces by con- sidering the results of the best predictors with respect to the individual sequences. Takingthe de-ctw average losses, in the classical music domain we see that the Mozart pieces arethe easiest to predict (with an average loss of at most 0.93). The least predictable piecesare Debussyâs Children Corner 6 (1.86) and Beethovenâs Symphony 6(1.78 loss). The mostpredictable piece overall is the rock piece âYou really got meâ by the Kinks. The leastpredictable rock piece is âSunshine of your loveâ by Cream. The least predictable pieceoverall is the jazz piece âSatin doll 2â by Duke Ellington. The most predictable jazz pieceis âThe girl from Ipanemaâ by Jobim. Among the improvisations of âAll of meâ, thereis considerable variability starting with 0.5 loss (All of me 12) and ending with 1.65 loss(All of me 8). We emphasize that such observations may have a qualitative value due tothe great variability among diï¬erent arrangements of the same musical piece (particularlyjazz tunes involving improvisation). The readers are encouraged to listen to these piecesand judge the âcomplexityâ of these pieces for themselves. MIDI ï¬les of all the pieces areavailable at http://www.cs.technion.ac.il/ ronbeg/vmm. The prediction results for the protein corpus appear in Table 5. Here ppm-c is the overall winner and de-ctw is the runner-up. The diï¬erence between them and the other 404
On Prediction Using Variable Order Markov Models Sequence bi-ctw de-ctw lz78 lz-ms ppm-c pst (length·103)bib(111) 2.15 1.8* 3.25 2.26 1.91 2.57 book1(785) 2.29 2.2* 3.3 2.57 2.27 2.5 book2(611) 2.5 2.36* 3.4 2.63 2.38 2.77 geo(102) 4.93 4.47* 4.48 4.48 4.62 4.86 news(377) 3.04 2.81* 3.72 3.07 2.82 3.4 obj1(22) 7 6.82 5.94* 6 6.75 6.37 obj2(247) 3.92 3.56 3.99 3.48 3.45* 3.72 paper1(53) 3.48 2.95* 4.08 3.38 3.08 3.93 paper2(82) 2.75 2.49* 3.63 2.81 2.53 3.19 paper3(47) 2.96 2.59* 3.82 3.09 2.75 3.4 paper4(13) 3.87 3.63 4.13 3.61 3.36* 3.99 paper5(12) 4.57 4.5 4.51 4.16 3.92* 4.6 paper6(38) 3.9 3.21* 4.22 3.6 3.31 4 pic(513) 0.75 0.71 0.78 0.79 0.73 0.7* progc(40) 3.28 2.85* 3.95 3.22 2.94 3.71 progl(72) 3.2 2.85* 3.7 3.26 2.99 3.66 progp(49) 2.88 2.53 3.24 2.65 2.52* 2.95 trans(94) 2.65 2.08* 3.56 2.61 2.29 3.01 Average±SEM 3.34 ± 0.07 3.02 ± 0.07â 3.76 ± 0.05 3.2 ± 0.06 3.03 ± 0.07 3.52 ± 0.06 Table 3: Average log-loss (equivalent to bits/symbol) of the VMM algorithms over the Calgary Corpus. For each sequence (row) the winner is starred and appears inboldface. The runner-up appears in boldface. algorithms is not very large, with the exception of the pst algorithm, which suï¬ered asigniï¬cantly larger average loss.22 The striking observation, however, is that none of the algorithms could beat a trivial prediction based on a (zero-order) âbackgroundâ distribution of amino acid frequencies asmeasured over a large database of proteins. The entropy of this background distributionis 4.19. Moreover, the entropy of a yet more trivial predictor based on the uniform distri-bution over an alphabet of size 20 is 4.32 â log2(20). Thus, not only could the algorithmsnot outperform these two trivial predictors, they generated predictions that are sometimesconsiderably worse. These results indicate that the ï¬rst half of a protein sequence does not contain predictive information about its second half. This is consistent with the general perception that proteinchains are not repetitive 23. Rather, usually they are composed of several diï¬erent elementscalled domains, each one with own speciï¬c function and unique source distribution (Rose, 22. The failure of pst in this setting could be a result of an inadequate set of possible values for its hyper- parameters; see Appendix A for details on the choices made. 23. This is not always true. It is not unusual to observe similar patterns (though not identical) within the same protein. However, this phenomenon is observed for a relatively small number of proteins. 405
Begleiter, El-Yaniv & Yona Sequence bi-ctw de-ctw lz78 lz-ms ppm-c pst Goldberg Variations 1.09 1* 1.59 1.28 1.04 1.15 Toccata and Fuga 1.17 1.08* 1.67 1.34 1.14 1.37 for Elise 1.76 1.57 2.32 1.83 1.57 2.08 Beethoven - Symphony 6 1.91 1.78* 2.2 1.99 1.79 2.26 Chopin - Etude 1 op. 10 1.14 1.07* 1.75 1.36 1.09 1.24 Chopin - Etude 12 op. 10 1.67 1.54* 2.11 1.93 1.73 1.72 Childrenâs Corner - 1 1.28 1.23* 1.53 1.52 1.3 1.39 Childrenâs Corner - 6 2.03 1.86* 2.28 2.13 1.97 2.28 Mozart K.551 1.04 0.85* 1.64 1.18 0.96 1.45 Mozart K.183 1.04 0.93* 1.71 1.2 0.99 1.7 Rachmaninov Piano Concerto 2 1.16 1.06* 1.8 1.35 1.13 1.34 Rachmaninov Piano Concerto 3 1.91 1.75 2.09 1.98 1.82 2.1 Average Classical 1.43 ± 0.02 1.31 ± 0.02 1.89 ± 0.02 1.59 ± 0.02 1.38 ± 0.02 1.67 ± 0.03 Giant Steps 1.51 1.33 2.01 1.29 1.24* 1.82 Satin Doll 1 1.7 1.41* 2.28 1.73 1.47 1.89 Satin Doll 2 2.56 2.27* 2.53 2.54 2.38 3.32 The Girl from Ipanema 1.4 1.11* 1.95 1.5 1.2 1.36 7 Steps to Heaven 1.7 1.42* 2.1 1.87 1.53 1.86 Stolen Moments 1.81 1.24* 2.28 1.59 1.37 1.8 Average Jazz 1.78 ± 0.06 1.46 ± 0.06â 2.19 ± 0.04 1.76 ± 0.04 1.53 ± 0.06 2.01 ± 0.09 All of Me 1 1.97 1.43* 2.33 1.97 1.73 1.88 All of Me 2 2.51 1.05* 2.52 1.9 1.65 1.28 All of Me 3 1.29 1.08* 1.97 1.42 1.17 1.45 All of Me 4 1.83 1.4* 2.12 1.88 1.46 2.32 All of Me 5 0.92 0.78* 1.72 1.15 0.9 0.96 All of Me 6 1.62 1.21* 2.21 1.55 1.28 1.52 All of Me 7 1.97 1.65* 2.35 1.94 1.75 2.44 All of Me 8 1.88 1.61* 2.29 1.97 1.67 2.21 All of Me 9 1.79 1.58* 2.2 1.97 1.7 2.13 All of Me 10 1.32 1.07* 1.96 1.42 1.16 1.4 All of Me 11 1.61 1.54* 2.26 1.93 1.72 1.68 All of Me 12 0.79 0.5* 1.8 0.92 0.54 0.57 All of Me 13 0.97 0.68* 1.78 1.12 0.7 0.83 All of Me 14 1.85 1.63* 2.2 1.89 1.68 2.17 Average Jazz Impro. 1.59 ± 0.17 1.23 ± 0.09â 2.12 ± 0.09 1.65 ± 0.11 1.37 ± 0.12 1.63 ± 0.15 Donât Stop till You Get Enough 1.03 0.64* 1.82 1.07 0.69 0.72 Sir Duke 1.14 0.47 2 0.99 0.59 0.35* Letâs Dance 1.22 0.94* 1.95 1.33 1.06 1.28 Like a Rolling Stone 1.63 1.45* 2.07 1.7 1.48 1.83 Sunshine of Your Love 1.68 1.47* 2.16 1.81 1.58 2.01 You Really Got Me 0.31 0.16* 1 0.49 0.23 0.17 Average Rock/Pop 1.17 ± 0.05 0.85 ± 0.04â 1.83 ± 0.04 1.23 ± 0.04 0.94 ± 0.04 1.06 ± 0.06 Average±SEM 1.49 ± 0.08 1.21 ± 0.05â 2.00 ± 0.05 1.56 ± 0.05 1.30 ± 0.06 1.59 ± 0.08 Table 4: Music MIDI sequences. Average log-loss (equivalent to bits/symbol) of the VMM algorithms over the music corpus. The table is partitioned into four sections: classical pieces, jazz pieces, 14 improvisations of the same jazz piece âAll of meâ and rock pieces. Each number is an average loss corresponding to several MIDI channels (see Appendix B). For each sequence (row) the winner is starred and appears in boldface. The runner-up appears in boldface. 406
On Prediction Using Variable Order Markov Models 1979; Lesk & Rose, 1981; Holm & Sander, 1994). These unique distributions are usuallymore statistically diï¬erent from each other (in terms of the KL-divergence) than from thebackground or uniform distribution. Therefore, a model that is trained on the ï¬rst halfof a protein is expected to perform worse than the trivial predictor - which explains theslight increase in the average code length. A similar ï¬nding was observed by Bejerano andYona (2001) where models that were trained over speciï¬c protein families coded unrelatednon-member proteins using an average code length that was higher than the entropy of thebackground distribution. This is also supported by other papers that suggest that proteinsequences are incompressible (e.g. Nevill-Manning & Witten, 1999). Sequences Class bi-ctw de-ctw lz78 lz-ms ppm-c pst A (1310) 4.85 ± 0.003 4.52 ± 0.007 4.80 ± 0.005 4.79 ± 0.007 4.45 ± 0.005â 8.07 ± 0.01 B (1604) 4.87 ± 0.003 4.55 ± 0.006 4.83 ± 0.005 4.85 ± 0.007 4.47 ± 0.005â 8.06 ± 0.009 C (3991) 4.85 ± 0.002 4.49 ± 0.003 4.89 ± 0.003 4.92 ± 0.004 4.43 ± 0.002â 8.05 ± 0.005 D (2325) 4.85 ± 0.003 4.65 ± 0.006 4.91 ± 0.004 4.98 ± 0.005 4.54 ± 0.004â 8.13 ± 0.008 E (402) 4.88 ± 0.005 4.36 ± 0.006 4.78 ± 0.009 4.83 ± 0.008 4.33 ± 0.004â 7.92 ± 0.01 F (157) 4.94 ± 0.015 4.37 ± 0.017 4.71 ± 0.014 4.78 ± 0.017 4.34 ± 0.013â 8.00 ± 0.023 G (16) 4.79 ± 0.018â 5.00 ± 0.052 5.08 ± 0.075 5.03 ± 0.105 4.82 ± 0.047 8.45 ± 0.13 Average±SEM 4.86 ± 0.007 4.56 ± 0.014 4.85 ± 0.16 4.88 ± 0.022 4.48 ± 0.011â 8.09 ± 0.028 Table 5: Protein sequences average prediction loss. A trivial distribution assignment based on a âbackgroundâ amino-acid frequency (measured over a large protein database)has an entropy of 4.19. Note also that maximal entropy of any distribution isbounded above by the entropy of the uniform distribution 4.32 â log2(20). Foreach sequence (row) the winner is starred and appears in boldface. The runner-upappears in boldface. The last row represents the averages and standard error ofthe means. 6. Protein Classiï¬cation For protein sequences, a more natural setup than the sequence prediction setting is theclassiï¬cation setting. Biologists classify proteins into relational sets such as âfamiliesâ, âsu-perfamiliesâ, âfold familiesâ and âclassesâ; these terms were coined by Dayhoï¬ (1976) andLevitt and Chothia (1976). Being able to classify a new protein in its proper group canhelp to elucidate relationships between novel genes and existing proteins and characterizeunknown genes. Prediction algorithms can be used for classiï¬cation using a winner-takes-all (WTA) approach. Consider a training set of sequences belonging to k classes C1, . . . , Ck. For eachclass Ci we train a sequence predictor Ë Pi. Given a test sequence x we classify it as class c = arg max Ë i Pi(x). We tested the WTA classiï¬ers derived from the above VMM algorithms over several large protein classiï¬cation problems. These problems consider the same protein sequencesmentioned in the above prediction experiment. For classiï¬cation we used the partition intosubsets given in the SCOP database. Speciï¬cally, the SCOP database (Murzin et al., 1995) 407
Begleiter, El-Yaniv & Yona classiï¬es proteins into a hierarchy of âfamiliesâ, âsuperfamiliesâ, âfoldsâ and âclassesâ whereeach âclassâ is composed of several âfoldsâ, each âfoldâ is composed of several âsuperfamiliesâand each âsuperfamilyâ is composed of several âfamiliesâ. This classiï¬cation was done man-ually based on both sequence and structural similarities, resulting in seven major âclassesâ(at the top level), 765 âfoldsâ,24 1231 âsuperfamiliesâ and 2163 âfamiliesâ (release 1.63). The seven major SCOP classes are quite distinct and can be easily distinguished (Ku- marevel et al., 2000; Jin et al., 2003). Here we focused on the next level in this hierarchy,the âfoldâ level. Our task was to classify proteins, that belong to the same âclassâ, intotheir unique âfoldâ. Since there are seven âclassesâ in the SCOP database, we have sevenindependent classiï¬cation problems. The number of âfoldsâ (i.e., machine learning classes)in each âclassâ appear in Table 6. For example, the protein âclassâ D contains 111 âfoldsâ.The classiï¬cation problem corresponding to D is, therefore, a multi-class problem with 111(machine learning) classes. Consider one (of the seven) classiï¬cation problems. For eachâfoldâ, a statistical model was trained from a training subset of sequences that belong tothat âfoldâ. Thus, the number of models we learn is the number of âfoldsâ (111 in the DâClassâ). The resulting classiï¬er is a WTA combination of all these models. The test set forthis classiï¬er contains all the proteins from the same SCOP âclassâ that do not appear inany training set. Of the 765 âfoldsâ, we considered only âfoldsâ with at least 10 members. VMM models were trained (one for each âfoldâ) and tested. The test results are grouped and averaged perSCOP âclassâ. The number of âfoldsâ in these classes varies between 12-111 (with an averageof 58 âfoldsâ). As before, the experiment was repeated ï¬ve times using ï¬ve-fold CV. We measured the performance using the average over the zero-one loss (zero loss for a correct classiï¬ca-tion). Within each fold we ï¬ne-tuned the hyper-parameters of the algorithm using the sameâinternalâ ï¬ve-fold cross validation methodology as described in Section 4. In this setup we used a slightly diï¬erent pst variant (denoted as pstâ) based on (Bejer- ano & Yona, 2001).25 The pstâ variant is basically a standard pst with the following twomodiï¬cations: (i) A context is chosen to be a candidate according to the number of timesit appears in diï¬erent âfoldâ sequences; (ii) A context s is added to the tree if some symbolappears after s at least some predeï¬ned number of times. We present the classiï¬cation-experiment results in Table 6. The consistent winner here is the lz-ms algorithm, which came ï¬rst in all of the classiï¬cation problems, diï¬eringsubstantially from the runner-ups, de-ctw and ppm-c. In general, all the VMM algorithmsâ classiï¬cations had a good average success rate (1â error) that ranges between 76% (for pstâ) to 86% (for lz-ms). The latter result is quitesurprising, considering that the class deï¬nition is based on structural properties that arenot easily recognizable from the sequence. In fact, existing methods of sequence comparisonusually fail to identify relationships between sequences that belong to diï¬erent superfamilies 24. Note the diï¬erence between the SCOP âclassesâ and the term class we use in classiï¬cation problems; similarly, note the diï¬erence between the SCOP âfoldsâ (that are subsets of the SCOP âclassesâ) from thefolds we use in cross-validation. Throughout our discussion, the biological âclassesâ and âfoldsâ are cited. 25. The results of the original pst algorithm were poor (probably due to an inadequate set of values for the hyper-parameters). Recall that the pstâ (and all other algorithms) hyper-parameter values were chosenbefore starting the experiments, following the description given in Section 4. 408
On Prediction Using Variable Order Markov Models within the same fold. The success rate of the lz-ms algorithm in this task suggests thatthe algorithm might be very eï¬ective in other protein classiï¬cation tasks. Protein âfoldsâ bi-ctw de-ctw lz78 lz-ms ppm-c pstâ âClassâ A 86 0.26 ± 0.035 0.19 ± 0.031 0.28 ± 0.035 0.16 ± 0.031 0.21 ± 0.031 0.22 ± 0.009 B 75 0.23 ± 0.031 0.17 ± 0.031 0.24 ± 0.040 0.14 ± 0.031 0.18 ± 0.035 - C 78 0.27 ± 0.031 0.22 ± 0.031 0.27 ± 0.040 0.17 ± 0.035 0.21 ± 0.031 - D 111 0.24 ± 0.040 0.19 ± 0.031 0.26 ± 0.040 0.16 ± 0.031 0.18 ± 0.035 0.20 ± 0.03 E 17 0.12 ± 0.031 0.09 ± 0.031 0.1 ± 0.044 0.05 ± 0.017 0.08 ± 0.017 0.16 ± 0.076 F 12 0.26 ± 0.031 0.21 ± 0.044 0.19 ± 0.031 0.17 ± 0.03 0.24 ± 0.067 0.24 ± 0.062 G 29 0.33 ± 0.044 0.25 ± 0.035 0.31 ± 0.05 0.16 ± 0.017 0.23 ± 0.035 0.36 ± 0.044 Average 0.24 0.18 0.23 0.14 0.19 0.24 W. Average 0.25 0.19 0.25 0.15 0.19 0.28 Table 6: Protein corpus results â classiï¬cation error rate±SEM. ï¬ve-fold cross validation results. Best results appear in boldface. W. Average stands for âweighted averageâwhere the weights are the relative sizes of the protein classes. The success of the algorithms in the classiï¬cation setup contrasts with their poor pro- tein prediction rates (given in Table 5). To investigate this phenomenon, we conductedanother protein prediction experiment based on the classiï¬cation setup. In each fold in thisexperiment (among the ï¬ve-CV), a predictor Ë Pi was trained using the concatenation of all training sequences in class Ci and was tested against each individual sequence in the testof class Ci. The results of this prediction experiment are given in Table 7. Here we can seethat all of the algorithms yield signiï¬cantly smaller losses than the loss of the trivial pro-tein âbackgroundâ distribution. Here again, ppm and de-ctw are favorites with signiï¬cantstatistical diï¬erence between them. 7. Related work In this section we brieï¬y discuss some results that are related to the present work. Werestrict ourselves to comparative studies that include algorithms we consider here and todiscussions of some recent extensions of these algorithms. We also discuss some resultsconcerning compression and prediction of proteins. The ppm idea and its overwhelming success in benchmark tests has attracted con- siderable attention and numerous ppm variants have been proposed and studied. ppm-d(Howard, 1993) is very similar to to ppm-c, with a slight modiï¬cation of the escape mech-anism; see Section 3.2. The ppmâ variant eliminates the need to specify a maximal order(Cleary & Teahan, 1997). For predicting the next symbolâs probability, ppmâ uses âdeter-ministicâ contexts, contexts that give probability one to some symbols. If no such contextexists, then ppmâ use the longest possible context (i.e., the original ppm strategy). One ofthe most successful ppm variants over the Calgary Corpus is the ppm-z variant of Bloom 409
Begleiter, El-Yaniv & Yona Sequences bi-ctw de-ctw lz78 lz-ms ppm-c pstâ Class A 2.74 ± 0.043 1.95 ± 0.033â 3.94 ± 0.015 2.29 ± 0.034 1.96 ± 0.036 2.97 ± 0.049 B 2.57 ± 0.033 2.14 ± 0.024 3.76 ± 0.013 2.18 ± 0.026 1.92 ± 0.028â - C 2.63 ± 0.036 2.07 ± 0.028 4.10 ± 0.009 2.30 ± 0.031 1.93 ± 0.030â - D 2.59 ± 0.038 1.59 ± 0.032â 3.65 ± 0.021 2.15 ± 0.031 1.83 ± 0.031 2.64 ± 0.037 E 1.52 ± 0.105 1.24 ± 0.081 3.47 ± 0.052 1.49 ± 0.075 1.16 ± 0.081â 2.20 ± 0.105 F 2.99 ± 0.155 2.34 ± 0.121 3.92 ± 0.054 2.49 ± 0.13 2.15 ± 0.135â 3.04 ± 0.141 G 3.49 ± 0.057 2.30 ± 0.057â 4.04 ± 0.025 2.88 ± 0.058 2.53 ± 0.057 3.56 ± 0.051 Average±SEM 2.64 ± 0.066 1.94 ± 0.053 3.84 ± 0.027 2.25 ± 0.055 1.92 ± 0.056â 2.88 ± 0.076 Table 7: Protein prediction based on the classiï¬cation setup. The classes B and C results of pst are missing. We terminated these runs after a week of computation. Observethe diï¬erence between these and the results of Table 5. Here all of the algo-rithms beat the two trivial predictors (âbackgroundâ frequency and the uniformdistribution) mentioned in caption of Table 5. (1998). The ppm-z variant introduces two improvements to the ppm algorithm: (i) estimat-ing the best model order, for each encoded symbol; (ii) adding a dynamic escape estimationfor diï¬erent contexts. To date, the most successful ppm variant is the âcomplicated PPMwith information-inheritanceâ (ppm-ii) of Shkarin (2002), which achieves a compression rateof 2.041. Currently, this is the lowest published rate over the Calgary Corpus. The ppm-iialgorithm relies on an improved escape mechanism similar to the one used by ppm-z to-gether with a diï¬erent counting heuristic that aims to reduce the statistical sparseness thatoccurs in large Markovian contexts. Traditionally, the comparison between ppm variants (and other lossless compression al- gorithms) is over the Calgary Corpus. Other corpora for lossless compression were proposedand are available. Two examples are the Canterbury Corpus and the âLarge CanterburyCorpusâ (Arnold & Bell, 1997) and the Silesia Compression Corpus (Deorowicz, 2003), whichcontains signiï¬cantly larger ï¬les than both the Calgary and Canterbury corpora.26 Bunton (1997) provides a (Calgary Corpus) comparison between ppm-c, ppm-d and ppmâ (with its âCâ and âDâ versions). The average compression rates achieved by thesealgorithm are: 2.417, 2.448, 2.339 and 2.374, respectively, with a ï¬xed order bound D = 9for ppm-c and ppm-d. According to Bloom27, his latest version of ppm-z achieves an averagerate of 2.086. Bunton also proposes various improvements for the ppm schemes. Her bestimproved ppm achieves a 2.177 rate.28 The general-purpose ppm scheme turns out to be rather ï¬exible and allows adaptations to particular domains. For example, Nevill-Manning and Witten (1999) achieve a slightlybetter rate than the trivial (log2(20)) one for proteins, using a specialized variant of ppm-d 26. These corpora are available at http://corpus.canterbury.ac.nz/ and http://sun.iinf.polsl. gliwice.pl/ sdeor/corpus.htm, respectively. 27. See http://www.cbloom.com/src/ppmz.html28. At the time (1996) this variant was probably the most successful ppm version. 410
On Prediction Using Variable Order Markov Models that uses prior amino-acids frequencies to enhance the estimation. By combining domainknowledge into ppm using preprocessing and ppm that predicts over dual-alphabets, Drinicand Kirovski (2002) develop a specialized algorithm for compressing computer executableï¬les (.exe). A ppm variant for compressing XML ï¬les was proposed by Cheney (2001). Atext-mining utilization of the ppm algorithm was tested by Witten et al. (1999) and a fewtext classiï¬cation setups were studied by Thaper (2001), Teahan and Harper (2001). Context Tree Weighting (ctw), with its proven redundancy bound, has also attracted much attention. However, unlike the ppm approach, there are not as many empirical studiesof this algorithm. A possible reason is that the original binary ctw (and straightforward ex-tensions to larger alphabets) did not achieve the best lossless compression results (Tjalkenset al., 1997). The work of Volf, which extends the binary ctw to general alphabets, re-sulting in the de-ctw algorithm (see Section 3.4), showed that ctw can achieve excellentcompression rates. Since then, the ctw scheme has been extended in several directions.Perhaps the most important extension is a modiï¬ed ctw scheme that does not requirethe maximal order hyper-parameter (D). The redundancy of this modiï¬ed algorithm isbounded in a similar manner as the original version. This more recent ctw variant is thusuniversal with respect to the class of stationary and ergodic Markov sources of ï¬nite order. Specialized ctw variants were proposed for various domains. For text, Aberg and Shtarkov (1997) proposed to replace the ppm estimator with the KT-estimator in the ctwmodel. They show that this combination of ppm-d and ctw outperforms ppm-d. Thisidea was examined by Sadakane et al. (2000) for both text and biological sequences. Theirresults also support the observation that the combination of a ppm predictor with the ctwmodel outperforms ppm-d compression both for textual data as well as for compressingDNA. We are familiar with a number of extensions of the pst scheme. Singer (1997) extends the prediction scheme for sequence transduction and also utilizes the ctw model averaging idea.Bejerano and Yona (2001) incorporated biological prior knowledge into pst prediction andshowed that the resulting pst-based similarity measure can outperform standard methodssuch as Gapped-BLAST, and is almost as sensitive as a hidden Markov model that istrained from a multiple alignment of the input sequences, while being much faster. Finally,Kermorvant and Dupont (2002) propose to improve the smoothing mechanism used in theoriginal pst algorithm. Experiments with a protein problem indicate that the proposedsmoothing method can help. As mentioned earlier, there are many extensions of the lz78 algorithm. See (Nelson & Gailly, 1996) for more details. It is interesting to note that Volf considered an âensembleâconsisting of both de-ctw and ppm. He shows that the resulting algorithm performs nearlyas well as the ppm-z on the Calgary Corpus while beating the ppm-z on the CanterburyCorpus. Previous studies on protein classiï¬cation used a variety of methods, including generative models such as HMMs (Sonnhammer et al., 1997), PSTs (Bejerano & Yona, 2001), anddiscriminative models such as Support Vector Machines (Ding & Dubchak, 2001; Leslieet al., 2004). Jaakkola et al. (1999) describe a model that combines the SVM discriminativemodel with a generative model (the HMM-based Fischer kernel). These methods can bequite eï¬ective for protein classiï¬cation at the family level but they are often expensive totrain and require some manual intervention to obtain optimal results (e.g., special input 411
Begleiter, El-Yaniv & Yona pre-processing). Other studies used unsupervised learning techniques, such as clustering(Yona et al., 1999; Tatusov et al., 1997) and self-organizing maps (Agraï¬otis, 1997). Whilethese methods can be applied successfully to broader data sets, they are not as accurate asthe supervised-learning algorithms mentioned above. Of all models, the HMM model is perhaps the most popular and successful model for protein classiï¬cation to-date (Park et al., 1998). The speciï¬c class of HMMs that are usedto model protein families (bio-HMM) has a special architecture that resembles the one usedin speech-recognition. These models are ï¬rst-order Markov models. The methods tested inthis paper belong to the general class of HMMs. However, unlike the ï¬rst-order bio-HMMs,the family of models tested here is capable of modeling higher order sources. Moreover,in many cases these models obtain comparable performance to bio-HMMs (see the workof Bejerano & Yona, 2001) and are easier to train than bio-HMMs, which makes them anappealing alternative. It should be noted that the classiï¬cation task that we attempted here has a diï¬erent setting than the typical setting in other studies on protein classiï¬cation. In most studies,the classiï¬cation focuses on the family level and a model is trained for each family. Herewe focused on the structural fold-family level. This is a more diï¬cult task and it hasbeen the subject of strong interest in fold prediction tasks (CASP, 2002). Predicting thestructural fold of a protein sequence is the ï¬rst substantial step toward predicting the three-dimensional structure of a protein, which is considered one of the most diï¬cult problemsin computational biology. The success rate of the VMM algorithms, and speciï¬cally the lz-ms algorithm, is espe- cially interesting since the sequences that make up a fold-family belong to diï¬erent proteinfamilies with diï¬erent properties. These families manifest diï¬erent variations over thesame structural fold; however, detecting these similarities has been proved to be a diï¬cultproblem (Yona & Levitt, 2002), and existing methods that are based on pure sequenceinformation achieve only moderate success in this task. Although we have not tested bio-HMMs, we suspect that such a model would be only moderately successful, since a single model architecture is incapable of describing the di-verge set of protein families that make up a fold-family. In contrast, the VMM modelstested here are capable of transcending the boundaries between diï¬erent protein familiessince they do not limit themselves to a speciï¬c architecture. These models have veryfew hyper-parameters and they can be quite easily trained from the data without furtherpre-processing. Other studies that addressed a somewhat similar problem to ours used se-quence comparison algorithms (McGuï¬n et al., 2001) or SVMs and Neural Networks (Ding& Dubchak, 2001) reporting a maximal accuracy of 56%. It is hard to compare our resultsto the results reported in these studies since they used diï¬erent data sets and a diï¬erentassessment methodology. However, the remarkably high success rate of the lz-ms model inthis task suggests that this model can be very eï¬ective for structural-fold prediction. 8. Concluding Remarks and Open Questions In this paper we studied the empirical performance of a number of prominent predictionalgorithms. We focused on prediction settings that are more closely related to those required 412
On Prediction Using Variable Order Markov Models by machine learning practitioners dealing with discrete sequences. Previous results relatedto these algorithms usually focused on a standard compression setting. While it should not be expected that one algorithm will consistently outperform oth- ers on all tasks, our rather extensive empirical evaluations over the three domains indicatethat there are prominent algorithms, which consistently tend to generate more accuratepredictions than the other algorithms we examine. These algorithms are the âprediction bypartial matchâ (ppm-c) and âdecomposed context tree weightingâ (de-ctw). For classiï¬-cation of proteins we observe that there is also a consistent winner. However, somewhatsurprisingly, the best predictor under the log-loss is not the best classiï¬er. On the contrary,the consistently best protein classiï¬er is based on the mediocre lz-ms predictor! This algo-rithm is a simple modiï¬cation of the well-known Lempel-Ziv-78 (lz78) prediction algorithm,which can capture VMMs with large contexts. The surprisingly good classiï¬cation accuracyachieved by this algorithm may be of independent interest to protein analysis research andclearly deserves further investigation. This relative success of lz-ms is related to the winner-takes-all classiï¬cation scheme we use. A classiï¬er based on this approach can suï¬er from a degraded performance if only oneor a few of the models generated (for some of the classes) are wrongly ï¬tted to other classes(in which case the erroneous model is the winner). Clearly, the superior performance oflz-ms is related to its robustness in this regard and a closer inspection of the behavior ofthe algorithms may be revealing. A possible explanation of this robustness could be relatedto the unbounded memory length of the lz-ms model. In most of the other VMM modelsthe memory length is a hyper-parameter. However, as discussed in Section 7, some of theother algorithms such as ppm can be extended to work without a known bound on thecontext length. An important observation, seen in these results, is that one cannot rely on log-loss (compression) performance when selecting a VMM algorithm for classiï¬cation (using aWTA approach). Since the classiï¬cation of sequences has not been studied as extensivelyas compression and prediction, we believe that much can be gained by focusing on specializedVMM algorithms for classiï¬cation. We hope that our contribution will assist practitioners in selecting suitable prediction algorithms for prediction and classiï¬cation tasks and we also provide the code of all thealgorithms we considered. Finally, we conclude by a number of open questions and directionsfor future research. One of the most successful general-purpose predictors we examined is the de-ctw al- gorithm, with its alphabet decomposition mechanism, as suggested by Volf (2002). As itturns out, this mechanism is crucial for the success of the the ctw scheme. However, cur-rently there are no redundancy bounds (or other type of performance guarantees) for thisdecomposition. It would be interesting to prove such a bound, which will also motivate anoptimality criterion for the decomposition-tree (see Section 3.4). Note that the heuristicproposed by Volf for generating the decomposition-tree relies on Huï¬manâs coding, but weare not familiar with a compelling reason for the success of this approach. Similarly, the ppm scheme (and the ppm-c variant we examined) do not have any known bounds on its log-loss redundancy. Clearly, such bounds could signiï¬cantly contribute tounderstanding this algorithm (and its variants) and may help in designing stronger versionsof this scheme. 413
Begleiter, El-Yaniv & Yona One of the main factors that inï¬uence the success of the ppm algorithm is the details of its escape mechanism, including the allocation of probability mass for the âescapeâ event. Thisprobability allocation is, in essence, a solution to the zero frequency problem (also calledâmissing massâ problem). So far, there are no formal justiï¬cations for the performanceof the more successful escape implementations. It would be interesting to see if some ofthe recent results on the missing mass problem, such as those presented by Orlitsky et al.(2003) and McAllester and Ortiz (2003), can help in devising a formal optimality measurefor the ppm escape implementations. We can view the ctw algorithm as an ensemble of numerous VMMs. However, the weights of the various VMM models are ï¬xed. Various ensemble methods in machine learn-ing such as boosting and online expert advice algorithms attempt to optimize the weightsof the various ensemble members. Could we achieve better performance guarantees andbetter empirical performance using adaptive weight learning approaches? Volf (2002) hassome results (on combining lz78 and de-ctw) indicating that this is a promising direction.Other possibilities would be to employ portfolio selection algorithms to dynamically changethe weights of several models, along the lines suggested by Kalai et al. (1999). 9. Acknowledgments We are very grateful to Gil Bejerano, Charles Bloom and Paul Volf who provided valuableadvice and assistance. We would also like to thank Bob Carpenter for his PPMC compres-sion java implementation, Moti Nisenson for his LZms java implementation and Adiel BenShalom for his MIDI assistance. Finally, we thank the anonymous referees for their goodcomments. The work of Ran El-Yaniv was partially supported by the Technion V.P.R. fundfor the promotion of sponsored research and the partial support of the PASCAL networkof excellence. Appendix A. Algorithm Hyper-Parameters Selection In our experimental protocol each algorithm optimizes its hyper-parameters during thetraining phase. The best found set of parameters is then used for training (see Section 4 fordetails). This appendix speciï¬es the possible hyper-parameter values that were consideredfor each algorithm for both the prediction and classiï¬cation experiments. These values wereselected (based on commonsense) before the start of any experiment. We also provide here,for each algorithm, the average values that were actually selected. We present these hyper-parameter values in two tables: Table 8 for the prediction setup, and Table 9, for the protein classiï¬cation setup. We used two diï¬erent hyper-parametersets for these two setups due to memory complexity issues. Speciï¬cally, in the classiï¬cationsetup, as described in Section 6, the length of training sequences is signiï¬cantly larger thanthe length of training sequences in the prediction experiment of Section 4. For example,in the protein classiï¬cation, class âCâ consists of 4303 sequences from 78 diï¬erent âfoldsâand the average size of a sequence in class âCâ is 265. Therefore, the size of the trainingsequence for this class is 912, 236. In comparison, the maximal training sequence length inthe text prediction experiment is 785396 = 392, 698. Also, recall that in each classiï¬cation 2 414
On Prediction Using Variable Order Markov Models experiment, the number of constructed models is the number of âfoldsâ (e.g., 78 in class âCâ),which also increases the memory overhead. The requirement to accommodate long training sequences (in the classiï¬cation experi- ment) directly inï¬uences the space used by most of the algorithms. The lengths of trainingsequences described above made it impossible for us to complete the protein classiï¬cationexperiment using the set of feasible parameters as described in Table 8. Therefore, forclassiï¬cation we reduced the set of possible hyper-parmeter and, as can be seen, each pa-rameter set in Table 9 is a subset of the corresponding set in Table 8. Finally, note thatthis table refers to a slight variant of the standard pst, denoted by pstâ, which was usedin the protein classiï¬cation experiment (see Section 6). In Tables 8 and 9 we also present the average (overall runs) of the hyper-parameter values selected by the cross-validated optimization. For example, in Table 8 it is interestingto see that diï¬erent values were selected for the lz-ms algorithm, for diï¬erent datasets. Forprotein sequence predictions lz-ms is optimized with values that make it almost identical tothe lz78 algorithm (i.e. M = 0 and S = 0), while on the music predictions these parametersare optimized with signiï¬cantly larger values (M = 6.12 and S = 16.11). Observe thatthe average order selected for ppm-c on âtextâ data is 9.25. When originally introduced,ppm algorithms were applied with a hyper-parameter D = 4 (constrained by the memoryavailable on computers at the time). Since then, it has often been used with a somewhatlarger value, typically 5 or 6 and it was observed that performance begins to deterioratewhen D is increased beyond this. For example, see Cleary and Teahan (1997, Figure 2) for adrawing of compression vs. D for book1. We observe very similar behavior when consideringthe English text ï¬les alone (not shown in our table). For example, the value selected thehyper-parameter D in most text ï¬les (including book1 ) is 5. The pst hyper-parameters presented in Tables 8 and 9 corresponds to the pst algorithm description in Section 3.5. Speciï¬cally, the Pmin (alternatively, hits) hyper-parameter isthe threshold used for ï¬ltering âsuï¬x setâ candidates, α and γ deï¬ne the ï¬rst condition ofthe pst second stage, r (alternatively, Nmin) deï¬nes the second condition of pstâs secondstage and D deï¬nes the maximal order of the pst âsuï¬x setâ. Algorithm Hyper-parameter Possible Values Selected Values (Average) Text Music Protein ppm-c D 1, 3, . . . , 19 9.25 16.12 1.85 bi-ctw D 8, 16, 32, 64 42 57.76 34.3 de-ctw D 2, 4, 8, 16, 32 7.06 27.97 17.9 pst Pmin 0.0001, 0.001, 0.01, 0.1 0.006 0.0003 0.001 α 0 0 0 0 γ 0.0001, 0.001, 0.01, 0.1 0.0006 0.0005 0.0001 r 1.05 1.05 1.05 1.05 D 5, 10, 15, 20 7.33 15.6 20 lz-ms M 0, 2, . . . , 6, 8 2 6.12 1.06 S 0, 2, . . . , 16, 18 8.4 16.11 0.29 Table 8: Prediction setup hyper-parameter values. 415
Begleiter, El-Yaniv & Yona Algorithm Hyper-parameter Possible Values Selected Values (Average) ppm-c D 1, 3, 5, 7, 9 8.5 bi-ctw D 8, 16, 32 31 de-ctw D 4, 8 7.5 pstâ hits 2, 3, 4 5 α 0 0 Nmin 2, 3, 4, 5 4 γ 0.001 0.001 r 1.05 1.05 D 10, 15, 20 20 lz-ms M 0, 2, 4, 6, 8 5.9 S 0, 2, 4, 6, 8 7.9 Table 9: Protein classiï¬cation setup hyper-parameter values. Recall that in this setup we used a slight variation of the pst algorithm. The pstâ variation uses the Nminparameter that deï¬nes the threshold for a context candidate context; and the hitsparameter deï¬nes the threshold for a context to be added to the pst tree. Appendix B. Music Representation Our music data consists of polyphonic music pieces given as MIDI ï¬les. A polyphonicmusic piece is usually represented in a MIDI ï¬le as a collection of channels. Each channelis typically associated with a single musical instrument and includes instructions (calledâeventsâ) regarding which notes to play and how long and loud to play them. We treateach channel as an individual sequence. For example, if a piece is played by ï¬ve instruments,encoded in ï¬ve MIDI channels, we generated ï¬ve individual sequences, one for each channel.Note that each entry in Table 4 is an average of all channels of the corresponding piece. We now describe how we represent each MIDI channel. Each note in a channel is represented as a triple: pitch:volume:duration. Both pitch and volume have, according tothe MIDI protocol, 128 possible values. Duration is measured in milliseconds. We alsoinclude a special note, which we call a âsilence noteâ and allow for negative durations of thissilence note. Such negative assignments are in fact negative time intervals, which make itpossible to represent chords or other simultaneous melodic lines within the same channel. The sequences corresponding to a channel are the ascii representation of the sequence of pitch:volume:duration. For example, the note 102:83:4022 is represented as â102:83:4022:â.29Clearly, this representation is over an alphabet Σ = 0, 1, . . . , 9, :, â, of size is 12. It is important to note that our representation preserves much of the information en- coded in the original MIDI representation and, in reality, allows one to almost exactlyreconstruct the original tunes. Appendix C. On the LZ-MS Hyper-Parameters This appendix provides an indication on the eï¬ectiveness of the M and S hyper-parametersof the lz-ms algorithm. The experimental setup is identical to the prediction setup describedin Section 4. Table 10 shows the log-loss results over the Calgary Corpus of applications of 29. The ï¬rst ten notes of Chopinâs Etude 12 Op. 10 are: â83:120:240: 128:0:-240: 79:120:240: 128:0:-240: 77:120:240: 128:0:-240: 74:120:240: 128:0:-240: 71:120:240: 128:0:-180:â. 416
On Prediction Using Variable Order Markov Models the lz-ms algorithm: in the second column only the M parameter is enabled (and S = 0);and the third column provides the results when only the S parameter is enabled (andM = 0). The losses in the last column correspond to applications where both M and S areenabled (this column is identical to the lz-ms results in Table 3). It is evident that the bestchoice of M alone is consistently better than the best choice of S alone. However, the bestcombination of both M and S is consistently better than any of these parameters alone. Sequence M S lz-ms (length ·103) (param. value) (param. value) bib(111) 2.67 (2) 2.63 (18) 2.26* (2,12) book1(785) 2.76 (2) 3.03 (18) 2.57* (2,12) book2(611) 2.80 (2) 3.06 (18) 2.63*(2,12) geo(102) 4.48 (0) 4.48 (0) 4.48 (0,0) news(377) 3.19 (2) 3.39 (12) 3.07* (2,4) obj1(22) 5.75* (2) 6.24 (10) 6 (2,2) obj2(247) 3.54 (2) 3.74 (18) 3.48* (2,4) paper1(53) 3.54 (2) 3.72 (14) 3.38* (2,12) paper2(82) 3.10 (2) 3.23 (18) 2.81* (2,12) paper3(47) 3.32 (2) 3.44 (18) 3.09* (2,12) paper4(13) 3.68 (2) 3.79 (10) 3.61* (2,14) paper5(12) 4.26 (2) 4.42 (18) 4.16* (2,8) paper6(38) 3.70 (2) 3.91 (10) 3.6* (2,12) pic(513) 0.79 (2) 0.79 (4) 0.79 (2,0) progc(40) 3.34 (2) 3.57 (10) 3.22* (2,4) progl(72) 3.26 (2) 3.50 (18) 3.26 (2,16) progp(49) 2.75 (2) 2.93 (18) 2.65* (2,18) trans(94) 2.85 (2) 2.89 (16) 2.61* (4,18) Average 3.32 ± 0.06 3.48 ± 0.06 3.2 ± 0.06â Table 10: Comparing eï¬ectiveness of the lz-ms hyper-parameters. Second column: M is assigned the best value from the âfeasibleâ set 0, 2, 4, 6, 8 and S = 0. Thirdcolumn: S is assigned the best value from 0, 2, . . . , 16, 18 and M = 0. Lastcolumn: the best combination of both M and S is selected from the two feasiblesets, respectively. Best values are determined using a ï¬ve-fold CV. Each entryin this table provides the average log-loss and the corresponding value of thehyper-parameter that was used for obtaining this log-loss. For example, the 2.63loss for the bib ï¬le when optimizing S alone is obtained with S = 18. References Abe, N., & Warmuth, M. (1992). On the computational complexity of approximating probability distributions by probabilistic automata. Machine Learning, 9, 205â260. Aberg, J., & Shtarkov, Y. (1997). Text compression by context tree weighting. In Proceedings Data Compression Conference (DCC), pp. 377â386. IEEE Computer Society Press. Agraï¬otis, D. (1997). A new method for analyzing protein sequence relationships based on sammon maps. Protein Science, 6, 287â293. Arnold, R., & Bell, T. (1997). A corpus for the evaluation of lossless compression algorithms. In Designs, Codes and Cryptography, pp. 201â210. 417
Begleiter, El-Yaniv & Yona Bar-Joseph, Z., El-Yaniv, R., Lischinski, D., & Werman, M. (2001). Texture mixing and texture movie synthesis using statistical learning. IEEE Transactions on Visualization and ComputerGraphics, 7 (2), 120â135. Bejerano, G., & Yona, G. (2001). Variations on probabilistic suï¬x trees: Statistical modeling and the prediction of protein families.. Bioinformatics, 17 (1), 23â43. Bell, T., Cleary, J., & Witten, I. (1990). Text Compression. Prentice-Hall, Inc.Bloom, C. (1998). Solving problems of context modeling. http://www.cbloom.com/papers/index.html. Bunton, S. (1997). Semantically motivated improvements for PPM variants. The Computer Journal, 40 (2/3), 76â92. Burrows, M., & Wheeler, D. J. (1994). A block-sorting lossless data compression algorithm.. Tech. rep. 124, Digital Equipement Corporation. CASP (2002). Protein structure prediction center. http://predictioncenter.llnl.gov/.Cheney, J. (2001). Compressing XML with multiplexed hierarchical PPM models. In Data Com- pression Conference, pp. 163â172. Clarkson, B., & Pentland, A. (2001). Predicting daily behavior via wearable sensors. Tech. rep. Vismod TR540, MIT. Cleary, J., & Teahan, W. (1997). Unbounded length contexts for PPM. Computer Journal, 40, 67â75. Cleary, J., & Witten, I. (1984). Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, COM-32 (4), 396â402. Cover, T., & King, R. (1978). A convergent gambling estimate of the entropy of English. IEEE Transactions on Information Theory, 24 (4), 413â421. Cover, T., & Thomas, J. (1991). Elements of Information Theory. John Wiley and Sons, Inc.Dayhoï¬, M. (1976). The origin and evolution of protein superfamilies. In Federation Proceedings, pp. 2132â2138. Deorowicz, S. (2003). Universal lossless data compression algorithms. Ph.D. thesis, Silesian Univer- sity of Technology. Ding, C., & Dubchak, I. (2001). Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, 17, 349â358. Drinic, M., & Kirovski, D. (2002). PPMexe: PPM for compressing software. In Data Compression Conference, pp. 192â201. Dubnov, S., Assayag, G., Lartillot, O., & Bejerano, G. (2003). Using machine-learning methods for musical style modeling. IEEE Computer, 36 (10), 73â80. Feder, M., & Merhav, N. (1994). Relations between entropy and error probability. IEEE Transactions on Information Theory, 40 (1), 259â266. Holm, L., & Sander, C. (1994). Parser for protein folding units. Proteins, 19, 256â268.Howard, P. (1993). The design and analysis of eï¬cient lossless data compression systems. Ph.D. thesis, Brown University. Jaakkola, T., Diekhans, M., & Haussler., D. (1999). Using the Fisher kernel method to detect remote protein homologies. In Proceedings of the Seventh International Conference on IntelligentSystems for Molecular Biology, pp. 149â158. Jin, L., Fang, W., & Tang, H. (2003). Prediction of protein structural classes by a new measure of information discrepancy. Computational Biology and Chemistry, 27, 373â380. 418
On Prediction Using Variable Order Markov Models Kalai, A., Chen, S., Blum, A., & Rosenfeld, R. (1999). On-line algorithms for combining language models. In Proceedings of the International Conference on Accoustics, Speech, and SignalProcessing. Kermorvant, C., & Dupont, P. (2002). Improved smoothing for probabilistic suï¬x trees seen as variable order Markov chains. In European Conference on Machine Learning (ECML), pp.185â194. Krichevsky, R., & Troï¬mov, V. (1981). The performance of universal encoding. IEEE Transactions on Information Theory, 27, 199â207. Kumarevel, T., Gromiha, M., & Ponnuswamy, M. (2000). Structural class prediction: An application of residue distribution along the sequence. Biophys Chem., 88, 81â101. Langdon, G. (1983). A note on the Ziv-Lempel model for compressing individual sequences. IEEE Transactions on Information Theory, 29, 284â 287. Lesk, A., & Rose, G. (1981). Folding units in globular proteins. In Proceedings of the National Academy of Sciences, No. 78, pp. 4304â4308. Leslie, C., Eskin, E., Cohen, A., Weston, J., & Noble, W. (2004). Mismatch string kernels for discriminative protein classiï¬cation. Bioinformatics, 20, 467â476. Levitt, M., & Chothia, C. (1976). Structural patterns in globular proteins. Nature, 261, 552â558.Manzini, G. (2001). An analysis of the burrows-wheeler transform. Journal of the ACM, 48 (3), 407â430. McAllester, D., & Ortiz, L. (2003). Concentration inequalities for the missing mass and for histogram rule error. Journal of Machine Learning Research, 4, 895â911. McCallum, A., Freitag, D., & Pereira, F. (2000). Maximum entropy Markov models for information extraction and segmentation. In Proc. 17th International Conf. on Machine Learning, pp.591â598. Morgan Kaufmann, San Francisco, CA. McGuï¬n, L., Bryson, K., & Jones, D. (2001). What are the baselines for protein fold recognition?. Bioinformatics, 17, 63â72. Merhav, N., & Feder, M. (1995). A strong version of the redundancy-capacity theorem of universal coding. IEEE Transactions on Information Theory, 41 (3), 714â722. Miller, J., Goodman, R., & Smyth, P. (1993). On loss functions which minimize to conditional expected values and posterior probabilities. IEEE Transactions on Information Theory, 39 (4),1404â1408. Moï¬at, A. (1990). Implementing the PPM data compression scheme. IEEE Transactions on Com- munications, 38 (11), 1917â1921. Murzin, A., Brenner, S., Hubbard, T., & Chothia, C. (1995). SCOP: A structural classiï¬cation of proteins database for the investigation of sequences and structures. Journal of MolecularBiology, 247, 536â540. Nelson, M., & Gailly, J. (1996). The Data Compression Book (2nd ed.). MIS:Press.Nevill-Manning, C., & Witten, I. (1999). Protein is incompressible. In Data Compression Conference, pp. 257â266. Nisenson, M., Yariv, I., El-Yaniv, R., & Meir, R. (2003). Towards behaviometric security systems: Learning to identify a typist. In The 7th European Conference on Principles and Practice ofKnowledge Discovery in Databases. Orlitsky, A., Santhanam, N., & Zhang, J. (2003). Always Good Turing: Asymptotically optimal probability estimation. Science, 302 (5644), 427â431. Pachet, F. (2002). Playing with virtual musicians: The continuator in practice. IEEE MultiMedia, 9 (3), 77â82. 419
Begleiter, El-Yaniv & Yona Park, J., Karplus, K., Barrett, C., Hughey, R., Haussler, D., Hubbard, T., & Chothia, C. (1998). Sequence comparisons using multiple sequences detect three times as many remote homologuesas pairwise methods. Journal of Molecular Biology, 284, 1201â1210. Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech recog- nition. Proceedings of the IEEE, 77 (3), 257â286. Rabiner, L., & Juang, B. (1993). Fundamentals of Speech Recognition. Prentice-Hall, Inc.Rissanen, J. (1983). A universal data compression system. IEEE Transactions on Information Theory, 29 (5), 656â664. Rissanen, J. (1984). Universal coding, information, prediction, and estimation. IEEE Transactions on Information Theory, 30 (4), 629â636. Rissanen, J., & Langdon, G. (1979). Arithmetic coding. IBM Journal of Research and Development, 23 (2), 149â162. Ron, D., Singer, Y., & Tishby, N. (1996). The power of amnesia: Learning probabilistic automata with variable memory length. Machine Learning, 25 (2â3), 117â149. Rose, G. (1979). Hierarchic organization of domains in globular proteins. Journal of Molecular Biology, 134, 447â470. Sadakane, K., Okazaki, T., & Imai, H. (2000). Implementing the context tree weighting method for text compression. In Data Compression Conference, pp. 123â132. Savari, S. (1997). Redundancy of the Lempel-Ziv incremental parsing rule. IEEE Transactions on Information Theory, 43, 9â21. Sch¨ utze, H., & Singer, Y. (1994). Part-of-speech tagging using a variable memory Markov model. In Proceedings of the 32nd Conference on Association for Computational Linguistics, pp. 181â187. Association for Computational Linguistics. Shkarin, D. (2002). PPM: One step to practicality. In Data Compression Conference, pp. 202â212.Singer, Y. (1997). Adaptive mixtures of probabilistic transducers. Neural Computation, 9 (8), 1711â 1733. Singer, Y., & Tishby, N. (1994). Dynamical encoding of cursive handwriting. Biological Cybernetics, 71 (3), 227â237. Sonnhammer, E., Eddy, S., & Durbin, R. (1997). Pfam: A comprehensive database of protein domain families based on seed alignments. Proteins, 28, 405â420. Tatusov, R., Eugene, V., & David, J. (1997). A genomic perspective on protein families. Science, 278, 631â637. Teahan, W., & Harper, D. (2001). Using compression based language models for text categoriza- tion. In Workshop on Language Modeling and Information Retrieval, ARDA, Carnegie MellonUniversity, pp. 83â88. Thaper, N. (2001). Using compression for source based classiï¬cation of text. Masterâs thesis, Massachusetts Institute of Technology. Tjalkens, T., Volf, P., & Willems, F. (1997). A context-tree weighting method for text generating sources. In Data Compression Conference, p. 472. Tjalkens, T., & Willems, F. (1997). Implementing the context-tree weighting method: Arithmetic coding. In International Conference on Combinatorics, Information Theory and Statistics,p. 83. Volf, P. (2002). Weighting Techniques in Data Compression Theory and Algorithms. Ph.D. thesis, Technische Universiteit Eindhoven. Willems, F. (1998). The context-tree weighting method: Extensions. IEEE Transactions on Infor- mation Theory, 44 (2), 792â798. 420
On Prediction Using Variable Order Markov Models Willems, F., Shtarkov, Y., & Tjalkens, T. (1995). The context-tree weighting method: Basic prop- erties. IEEE Transactions on Information Theory, 653â664. Witten, I., & Bell, T. (1991). The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37 (4), 1085â1094. Witten, I., Bray, Z., Mahoui, M., & Teahan, B. (1999). Text mining: A new frontier for lossless compression. In Proceedings of the Conference on Data Compression, pp. 198â207. IEEEComputer Society. Yona, G., & Levitt, M. (2002). Within the twilight zone: A sensitive proï¬le-proï¬le comparison tool based on information theory. Journal of Molecular Biology, 315, 1257â1275. Yona, G., Linial, N., & Linial, M. (1999). Protomap: Automatic classiï¬cation of protein sequences, a hierarchy of protein families, and local maps of the protein space. Proteins, 37, 360â378. Ziv, J., & Lempel, A. (1978). Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory, 24, 530â536. 421