K. Kersting, L. De Raedt and T. Raiko
Volume 25, 2006
Links to Full Text:Journal of Artificial Intelligence Research 25 {2006} 425{456 Submitted 12/04;
published4/06Logical Hidden Marko v Mo dels
Kristian Kersting kersting@informatik.uni-freiburg.de
Luc De Raedt deraedt@informatik.uni-freiburg.de
Institute for Computer Science Albert-Ludwigs-Universitªat Freibur g
Ge or ges-Ko ehler-Al lee 079
D-79110 Freibur g, Germany Tapani Raiko tap ani.raiko@hut.fi
Lab or atory of Computer and Information Science
Helsinki University of Technolo gy
P.O. Box 5400 FIN-02015 HUT, Finland
Abstract Logical hidden Marko v mo dels {LOHMMs} upgrade traditional hidden
Marko v mo dels
to deal with sequences of structured symb ols in the form of logical atoms,
rather than flat
characters.
This note formally intro duces LOHMMs and presents solutions to the three
central in-
ference problems for LOHMMs: evaluation, most likely hidden state sequence
and param- eter estimation. The resulting representation and algorithms are
exp erimentally evaluated
on problems from the domain of bioinformatics.
1. Intro duction
Hidden Marko v mo dels {HMMs} {Rabiner & Juang, 1986} are extremely p opular
for an-
alyzing sequential data. Application areas include computational biology,
user mo delling, sp eech recognition, empirical natural language pro cessing,
and rob otics. Despite their suc-
cesses, HMMs hav e a ma jor weakness: they handle only sequences of flat,
i.e., unstruc-
tured symb ols. Yet, in many applications the symb ols o ccurring in sequences
are struc- tured. Consider, e.g., sequences of UNIX commands, which may hav
e parameters such
as emacs lohmms:tex; ls; latex lohmms:tex; : : :Thus, commands are essentially
structured. Tasks that hav e b een considered for UNIX command sequences
include the prediction of
the next command in the sequence {Davison & Hirsh, 1998}, the classification
of a command
sequence in a user category {Korvemak er & Greiner, 2000; Jacobs & Blo ck
eel, 2001}, and anomaly detection {Lane, 1999}. Traditional HMMs cannot easily
deal with this typ e of
structured sequences. Indeed, applying HMMs requires either 1} ignoring
the structure of the commands {i.e., the parameters}, or 2} taking all p
ossible parameters explicitly into
accoun t. The former approach results in a serious information loss; the
latter leads to a combinatorial explosion in the num b er of symb ols and
parameters of the HMM and as a consequence inhibits generalization. The ab
ov e sketc hed problem with HMMs is akin to the problem of dealing with struc-
tured examples in traditional machine learning algorithms as studied in
the fields of in- ductive logic programming {Muggleton & De Raedt, 1994}
and multi-relational learn-c
fl2006 AI Access Foundation. All rights reserved.Kersting, De Raedt, & Raikoing
{D024zeroski & Lavra 024c, 2001}. In this pap er, we prop ose an {inductive}
logic programming
framework, LogicalHMMs {LOHMMs}, that upgrades HMMs to deal with structure.
The
key idea underlying LOHMMs is to employ logical atoms as structured {output
and state}
symb ols. Using logical atoms, the ab ov e UNIX command sequence can b e
represented
as emacs{lohmms:tex} ; ls; latex{lohmms:tex} ; : : : There are tw o imp
ortant motivations for
using logical atoms at the symb ol level. First, variables in the atoms
allow one to make
abstraction of sp ecific symb ols. E.g., the logical atom emacs{X; tex}
represents all files X
that a L
A
T
E
X user tex could edit using emacs. Second, unification allows one to share
in-
formation among states. E.g., the sequence emacs{X; tex}; latex{X; tex}
denotes that the same file is used as an argument for b oth Emacs and L A
T
E X.
The pap er is organized as follows. After reviewing the logical preliminaries,
we intro duce
LOHMMs and define their semantics in Section 3; in Section 4, we upgrade
the basic
HMM inference algorithms for use in LOHMMs; we inv estigate the b enefits
of LOHMMs in
Section 5: we show that LOHMMs are strictly more expressive than HMMs, that
they can
b e - by design - an order of magnitude smaller than their corresp onding
prop ositional
instantiations, and that unification can yield mo dels, which b etter fit
the data. In Section 6,
we empirically inv estigate the b enefits of LOHMMs on real world data.
Before concluding,
we discuss related work in Section 7. Pro ofs of all theorems can b e found
in the App endix.
2. Logical Preliminaries
A first-order alphabet 006 is a set of relation symb ols r with arity m
025 0, written r=m, and a
set of functor symb ols f with arity n 025 0, written f=n. If n = 0 then
f is called a constant,
if m = 0 then p is called a prop ositional variable. {We assume that at
least one constant
is given.} An atom r{t
1 ; : : : ; t
n
} is a relation symb ol r follow ed by a brack eted n-tuple of terms t
i
. A term t is a variable V or a functor symb ol f{t
1 ; : : : ; t
k
} immediately follow ed by
a brack eted k -tuple of terms t
i
. Variables will b e written in upp er-case, and constant, func- tor and
predicate symb ols low er-case. The symb olwill denote anonymous variables
which are read and treated as distinct, new variables each time they are
encountered. An iterative clause is a formula of the form H B where H {called
head} and B {called bo dy} are logical
atoms. A substitution 022 = fV
1
=t
1
; : : : ; V
n
=t n
g, e.g. fX=texg, is an assignment of terms t i
to variables V
i
. Applying a substitution 033 to a term, atom or clause e yields the instanti-
ated term, atom, or clause e033 where all o ccurrences of the variables
V
i
are simultaneously
replaced by the term t
i
, e.g. ls{X} emacs{F; X}fX=texg yields ls{tex} emacs{F; tex}.
A substitution 033 is called a unifier for a finite set S of atoms if S
033 is singleton. A unifier 022
for S is called a most general unifier {MGU} for S if, for each unifier
033 of S , there exists a substitution fl such that 033 = 022 fl . A term,
atom or clause E is called ground when it contains
no variables, i.e., vars{E} = ;. The Herbrand base of 006, denoted as hb
006
, is the set of all
ground atoms constructed with the predicate and functor symb ols in 006.
The set G
006
{A} of
an atom A consists of all ground atoms A022 that b elong to hb
006
.
3. Logical Hidden Marko v Mo dels
The logical comp onent of a traditional HMM corresp onds to a Mealy machine
{Hop croft
& Ullman, 1979}, i.e., a finite state machine where the output symb ols
are asso ciated with426Logical Hidden Marko v Modelstransitions. This is
essentially a prop ositional representation b ecause the symb ols used to
represent states and output symb ols are flat, i.e. not structured. The
key idea underlying
LOHMMs is to replace these flat symb ols by abstract symb ols. An abstract
symb ol A is -
by definition - a logical atom. It is abstract in that it represents the
set of all ground, i.e., variable-free atoms of A ov er the alphab et 006,
denoted by G
006 {A}. Ground atoms then play
the role of the traditional symb ols used in a HMMs.Example 1Consider the
alphabet 006
1 which has as constant symbols tex, dvi, hmm1,
and lohmm1, and as relation symbols emacs=2, ls=1, xdvi=1, latex=2. Then
the atom
emacs{File; tex} repr esents the set femacs{hmm1; tex} ; emacs{lohmm1; tex}
g. We assume
that the alphabet is type d to avoid useless instantiations such as emacs{tex;
tex}}.
The use of atoms instead of flat symb ols allows us to analyze logical and
structured sequences
such as emacs{hmm1; tex} ; latex{hmm1; tex}; xdvi{hmm1; dvi} .Definition
1Abstract transition are expressions of the form p : H
O
000 B where p 2 [0; 1], and H, B and O are atoms. Al l variables are implicitly
assumed to be universal ly quantified,
i.e., the scop e of variables is a single abstract transition.
The atoms H and B represent abstract states and O represents an abstract
output symb ol. The semantics of an abstract transition p : H
O 000 B is that if one is in one of the states in
G 006
{B}, say B022
B , one will go with probability p to one of the states in G
006
{H022
B
}, say H022
B
022
H
, while emitting a symb ol in G
006
{O022
B
022 H
}, say O022
B
022 H
022
O .Example 2Consider c 021 0:8 : xdvi{File; dvi}
latex{File}
000000 000 000 000 000000 latex{File; tex} . In general
H, B and O do not have to share the same pre dic ate. This is only due to
the na-
ture of our running example. Assume now that we are in state latex{hmm1;
tex} , i.e.
022
B = fFile=hmm1g. Then c spe cifies that there is a prob ability of 0.8 that
the next state
wil l be in G
006 1
{xdvi{hmm1; dvi} } = fxdvi{hmm1; dvi}g { i.e., the prob ability is 0:8 that
the
next state wil l be xdvi{hmm1; dvi}}, and that one of the symbols in G
006
1
{latex{hmm1}} =
flatex{hmm1}g { i.e., latex{hmm1}} wil l be emitted. Abstract states might
also be more
c omplex such as latex{file{FileStem; FileExtension}; User}
The ab ov e example was simple b ecause 022
H
and 022
O
were b oth empty. The situation b e-
comes more complicated when these substitutions are not empty. Then, the
resulting state and output symb ol sets are not necessarily singletons. Indeed,
for the transi-
tion 0:8 : emacs{File 0
; dvi} latex{File}
000000 000 000 000 000000 latex{File; tex} theresulting state set
would b e
G
006
1
{emacs{File
0
; dvi}} = femacs{hmm1; tex}; emacs{lohmm1; tex} g. Thus the transition
is non-deterministic b ecause there are tw o p ossible resulting states.
We therefore need a
mechanism to assign probabilities to these p ossible alternatives.Definition
2The selection distribution 026 spe cifies for each abstract state and observation
symbol A over the alphabet 006 a distribution026{001 j A} over G
006
{A}.
To contin ue our example, let 026{emacs{hmm1; tex} j emacs{File
0
; tex}} = 0:4 and 026{emacs{lohmm1; tex} j emacs{File
0
; tex}} = 0:6. Then there would b e a probabil-
ity of 0:4 002 0:8 = 0:32 that the next state is emacs{hmm1; tex} and of
0:48 that it is
emacs{lohmm1; tex} .427Kersting, De Raedt, & RaikoTaking 026 into account,
the meaning of an abstract transition p : H
O 000 B can b e sum-
marized as follows. Let B022
B 2 G
006
{B}, H022
B
022
H
2 G
006
{H022
B
} and O022
B
022
H
022 O
2 G
006
{O022
B
022
H
}. Then the
mo del makes a transition from state B022
B
to H022
B
022 H
and emits symb ol O022
B
022
H
022 O
with probability p 001 026{H022
B
022 H
j H022 B
} 001 026{O022
B
022
H
022 O
j O022 B
022
H }: {1}
To represent 026, any probabilistic representation can - in principle -
b e used, e.g. a Bay esian
net w ork or a Marko v chain. Throughout the remainder of the present pap
er, how ev er, w e will use a naª020ve Bayes approach. More precisely, we
asso ciate to each argument of a relation r=m a finite domain D
r=m
i
of constants and a probability distribution P
r=m i
ov er
D
r=m
i
. Let vars{A} = fV
1
; : : : ; V
l
g b e the variables o ccurring in an atom A ov er r=m, and
let 033 = fV 1
=s
1 ; : : : V
l =s
l
g b e a substitution grounding A. Each V
j
is then considered a random variable ov er the domain D
r=m arg {V
j }
of the argument arg {V
j } it app ears first in. Then,
026{A033 j A} =
Q l
j =1
P
r=m
arg {V
j
} {s
j
}. E.g. 026{emacs{hmm1; tex} j emacs{F; E}}, is computed as the
pro duct of P
emacs=2 1
{hmm1} and P
emacs=2
2
{tex}.
Thus far the semantics of a single abstract transition has b een defined.
A LOHMM
usually consists of multiple abstract transitions and this creates a further
complication.Example 3Consider 0:8 : latex{File; tex}
emacs{File}
000000 000 000 000 000000 emacs{File; tex} and 0:4 : dvi{File}
emacs{File}
000000 000 000 000 000000 emacs{File; User} . These two abstract
transitions make
conflicting statements about the state resulting from emacs{hmm1; tex} .
Indee d, acc or ding
to the first transition, the prob ability is 0:8 that the resulting state
is latex{hmm1; tex} and
acc or ding to the sec ond one it assigns 0:4 to xdvi{hmm1}. There are essentially
tw o wa ys to deal with this situation. On the one hand, one might wan t
to combine and normalize the tw o transitions and assign a probability of
23
resp ectively
13
.
On the other hand, one might wan t to hav e only one rule firing. In this
pap er, we chose the
latter option b ecause it allows us to consider transitions more indep endently,
it simplifies learning, and it yields lo cally interpretable mo dels. We
employ the subsumption {or gen-
erality} relation among the B-parts of the tw o abstract transitions. Indeed,
the B-part of the first transition B
1
= emacs{File; tex} is more sp ecific than that of the second transi- tion
B
2
= emacs{File; User} b ecause there exists a substitution 022 = fUser=texg
such that B
2
022 = B
1
, i.e., B
2
subsumes B
1
. Therefore G
006
1 {B
1
} 022 G
006 1
{B
2
} and the first transition can
b e regarded as more informative than the second one. It should therefore
b e preferred ov er
the second one when starting from emacs{hmm1; tex} . We will also say that
the first tran-
sition is more spe cific than the second one. Remark that this generality
relation imp oses a
partial order on the set of all transitions. These considerations lead to
the strategy of only
considering the maximally sp ecific transitions that apply to a state in
order to determine
the successor states. This implements a kind of exception handling or default
reasoning and is akin to Katz's {1987} back-off n-gram mo dels. In back-off
n-gram mo dels, the most
detailed mo del that is deemed to provide sufficiently reliable information
ab out the current
con text is used. That is, if one encounters an n-gram that is not sufficiently
reliable, then
back-off to use an {n 000 1}-gram; if that is not reliable either then
back-off to level n 000 2, etc.
The conflict resolution strategy will work prop erly provided that the b
o dies of all max-
imally sp ecific transitions {matching a given state} represent the same
abstract state. This428Logical Hidden Marko v ModelsSTARTcc
EMACS1cc
EMACS2ccEMACS3cc
LATEXcc
LScc
EMACS1O1ccEMACS1O2cc
EMACS3O1cc
EMACS3O2cc
EMACS3O3ccLATEXO1cc
LATEXO2cc
LATEXO3cc
LSO1ccLSO2cc
STARTO1cc
STARTO2cc
Kristian KerstingEMACS1EMACS2EMACS3LATEXLSSTARTLATEXO1EMACS1O1EMACS1O2STARTO1STARTO2LATEXO2LSO1LATEXO3EMACS3O1EMACS3O2EMACS3O3LSO2PSfrag
replacementsstartemacs{F; U}emacs{F
0
; U}emacs{F; tex}latex{F; tex}ls{U
0
}emacs{F} : 0:7emacs{F} : 0:3emacs{F} : 0:3emacs{F} : 0:1emacs{F} : 0:6latex{F}
: 0:6latex{F} : 0:2latex{F} : 0:2ls : 0:6ls : 0:40:550:45Figure 1: A logical
hidden Marko v mo del.can b e enforced by requiring the generality relation
ov er the B-parts to b e closed under the
gre atest lower bound {glb} for each predicate, i.e., for each pair B 1
; B
2
of b o dies, such that
022 = mgu{B
1
; B
2
} exists, there is another b o dy B {called low er b ound} which subsumes
B
1 022
{therefore also B
2
022 } and is subsumed by B
1
; B
2 , and if there is any other low er b ound then it is subsumed by B. E.g.,
if the b o dy of the second abstract transition in our example is
emacs{hmm1; User} thenthe set of abstract transitions would not b e closed
under glb. Finally, in order to sp ecify a prior distribution ov er states,
we assume a finite set 007 of
clauses of the form p : H start using a distinguished start symb ol such
that p is the
probability of the LOHMM to start in a state of G
006
{H}.
By now we are able to formally define logic al hidden Markov models.Definition
3A logic al hidden Markov model {LOHMM} is a tuple {006; 026; 001; 007}
where 006 is
a logic al alphabet, 026 a selection prob ability over 006, 001 is a
set of abstract transitions, and007
is a set of abstract transitions enco ding a prior distribution. Let B be
the set of al l atoms
that oc cur as bo dy parts of transitions in 001. We assume B to be closed
under glb and re quir e
8B 2 B :
X
p:H
O
000B2001
p = 1:0 {2}
and that the prob abilities p of clauses in 007 sum up to 1:0 .
HMMs are a sp ecial cases of LOHMMs in which 006 contains only relation
symb ols of arity
zero and the selection probability is irrelevant. Thus, LOHMMs directly
generalize HMMs.
LOHMMs can also b e represented graphically. Figure 1 contains an example.
The under-
lying language 006 2
consists of 006 1
together with the constant symb ol other which denotes a
user that do es not employ L
A
T
E
X. In this graphical notation, no des represent abstract states
and black tippe d arrows denote abstract transitions. White tippe d arrows
are used to repre-
sent meta knowledge. More precisely, white tippe d, dashed arrows represent
the generality or
subsumption ordering b etw een abstract states. If we follow a transition
to an abstract state
with an outgoing white tippe d, dotted arrow then this dotted arrow will
alwa ys b e follow ed.
Dotted arrows are needed b ecause the same abstract state can o ccur under
different cir-
cumstances. Consider the transition p : latex{File
0
; User 0
}
latex{File}
000000 000 000 000 000000 latex{File; User} .429Kersting, De Raedt,
& RaikoemacsFUcc
emacsltcc
emacsFtccemacslcc
latexFtcc
latexltcc
latexlccemacsFpUcc
emacsoocc
emacsocc
lsUpcclstcc
lscc
mucc
mupccmuppcc
startcc
atomcc
stateccKristian Kerstingemacsl0.6atomemacsFt1.0latexltmupemacsFpUatomemacsFUstart0.45emacsFUatommuemacsltstateemacsoostateatommuppls0.4lsUpatomlatexFtatomstatelststateemacso0.70.6latexlPSfrag
replacementsem{F; U}em{f
1
; t}em{F; t}em{f
1
}la{F; t}la{f
1
; t}la{f 1
}em{F
0 ; U}em{f
2
; o}em{f
2
}ls{U 0
}ls{t}ls026026026startabstract statestateFigure 2:Generating the observation
sequence emacs{hmm1}; latex{hmm1};
emacs{lohmm1}; ls by the LOHMM in Figure 1. The command emacs is
abbreviated by em, f
1
denotes the filename hmm1, f 2
represents lohmm1, t denotes
a tex user, and o some other user. White tipp ed solid arrows indicate selections.Even
though the atoms in the head and b o dy of the transition are syntactically
different they represent the same abstract state. To accurately represent
the meaning of this transition we cannot use a black tipp ed arrow from latex{File;
User} to itself, b ecause this would actu-
ally represent the abstract transition p : latex{File; User}
latex{File}
000000 000 000 000 000000 latex{File; User} .
Furthermore, the graphical representation clarifies that LOHMMs are generative
mo d-
els. Let us explain how the mo del in Figure 1 would generate the observation
sequence
emacs{hmm1}; latex{hmm1}; emacs{lohmm1} ; ls {cf. Figure 2}. It cho oses
an initial ab- stract state, say emacs{F; U}. Since b oth variables F and
U are uninstantiated, the mo del
samples the state emacs{hmm1; tex} from G
006
2
using 026. As indicated by the dashed ar- row, emacs{F; tex} ismore sp
ecific than emacs{F; U}. Moreov er, emacs{hmm1; tex} matches
emacs{F; tex}. Thus, the mo del enters emacs{F; tex}. Since the value of
F was already
instantiated in the previous abstract state, emacs{hmm1; tex} issampled
with probability
1:0. Now, the mo del go es ov er to latex{F; tex}, emitting emacs{hmm1}
b ecause the abstract observation emacs{F} is already fully instantiated.
Again, since F was already instantiated,
latex{hmm1; tex} is sampled with probability 1:0. Next, we mov e on to emacs{F
0 ; U}, emit-
ting latex{hmm1}. Variables F
0 and U in emacs{F 0
; U} were not yet b ound; so, values, say
lohmm1 and others, are sampled from 026. The dotted arrow brings us back
to emacs{F; U}.
Because variables are implicitly universally quantified in abstract transitions,
the scop e of
variables is restricted to single abstract transitions. In turn, F is treated
as a distinct,
new variable, and is automatically unified with F
0
, which is b ound to lohmm1. In contrast,
variable U is already instantiated. Emitting emacs{lohmm1} , the mo del
makes a transition to ls{U
0 }. Assume that it samples tex for U
0
. Then, it remains in ls{U
0
} with probability
0:4 . Considering all p ossible samples, allows one to prov e the following
theorem.Theorem 1 {Semantics}A logic al hidden Markov model over a language
006 defines a discrete time stochastic pro c ess, i.e., a sequenc e of random
variables hX t
i
t=1;2;:::
, where the
domain of X t
is hb{006} 002 hb{006}. The induce d prob ability measur e over the Cartesian
pro duct N
t
hb{006} 002 hb{006} exists and is unique for each t > 0 and in the limit
t ! 1.
Before concluding this section, let us address some design choices underlying
LOHMMs.
First, LOHMMs hav e b een intro duced as Mealy machines, i.e., output symb
ols are
asso ciated with transitions. Mealy machines fit our logical setting quite
intuitiv ely as they
directly enco de the conditional probability P {O; S 0
jS} of making a transition from S to S
0430Logical Hidden Marko v Modelsemitting an observation O. Logical hidden
Marko v mo dels define this distribution as P {O; S 0
jS} = X
p:H
O
0
000B
p 001 026{S
0
j H033
B
} 001 026{O j O
0
033
B
033
H
}
where the sum runs ov er all abstract transitions H
O
0 000 B such that B is most sp ecific for S.
Observations corresp ond to {partially} observed pro of steps and, hence,
provide information
shared among heads and b o dies of abstract transitions. In contrast, HMMs
are usually intro duced as Moor e machines. Here, output symb ols are asso
ciated with states implicitly
assumingO and S 0
to b e indep endent. Thus, P {O; S
0 j S} factorizes into P {O j S} 001 P {S
0
j S}.
This makes it more difficult to observe information shared among heads and
b o dies. In
turn, Mo ore-LOHMMs are less intuitiv e and harder to understand. For a
more detailed discussion of the issue, we refer to App endix B where we essentially
show that { as in the
prop ositional case { Mealy- and Mo ore-LOHMMs are equivalent.
Second, the naª020ve Bay es approach for the selection distribution reduces
the mo del com-
plexity at the exp ense of a low er expressivity: functors are neglected
and variables are
treated indep endently. Adapting more expressive approaches is an interesting
future line of
research. For instance, Bayesian networks allow one to represent factorial
HMMs {Ghahra-
mani & Jordan, 1997}. Factorial HMMs can b e viewed as LOHMMs, where the
hidden
states are summarized by a 2 001 k -ary abstract state. The first k arguments
enco de the k
state variables, and the last k arguments serve as a memory of the previous
joint state. 026
of the i-th argument is conditioned on the i + k -th argument. Markov chains
allow one to
sample comp ound terms of finite depth such as s{s{s{0})} and to mo del
e.g. missp elled filenames. This is akin to generalize d HMMs {Kulp, Haussler,
Reese, & Eeckman, 1996}, in which each no de may output a finite sequence
of symb ols rather than a single symb ol.
Finally, LOHMMs { as intro duced in the present pap er { sp ecify a probability
distri-
bution ov er all sequences of a given length. Reconsider the LOHMM in Figure
1. Al-
ready the probabilities of all observation sequences of length 1, i.e.,
ls, emacs{hmm1}, and emacs{lohmm1} } sum up to 1. More precisely, for each
t > 0 it holds that
P
x
1
;:::;x t
P {X 1
=
x 1
; : : : ; X t
= x
t
} = 1:0 . In order to mo del a distribution ov er sequences of variable
length, i.e.,
P
t>0
P
x
1
;:::;x
t P {X
1 = x
1
; : : : ; X
t = x
t
} = 1:0 we may add a distinguished end state. The end state is absorbing
in that whenever the mo del makes a transition into this state, it terminates
the observation sequence generated.
4. Three Inference Problems for LOHMMs
As for HMMs, three inference problems are of interest. Let M b e a LOHMM
and let
O = O
1
; O 2
; : : : ; O T
, T > 0, b e a finite sequence of ground observations:{1} Evaluation:Determine
the probability P {O j M } that sequence O was generated by
the mo del M .{2} Most likely state sequence:Determine the hidden state
sequence S
003
that has most likely pro duced the observation sequence O, i.e. S 003
= arg max S
P {S j O; M } .{3} Parameter estimation:Given a set O O
O = fO 1
; : : : ; O
k
g of observation sequences, de-
termine the most likely parameters 025 003
for the abstract transitions and the selection
distribution of M , i.e. 025 003
= arg max 025
P {O
O
O j 025} .431Kersting, De Raedt, & Raikos0cc
s1cc
s2cc
startccsc1cc
sc2cc
scycc
hc1cchc2cc
hcxcc
sczcc
o1cco2cc
o3cc
atomcc
stateccKristian Kerstingls{t}ls{U'}ls{o}em{F',U}em{F',o}latex{f1,t}startem{F,U}ls{U'}ls{o}ls{t}em{f1,o}em{f1,t}latex{f1,t}ls{o}ls{t}ls{o}ls{t}ls{U'}em{F',U}em{f2,o}latex{f2,t}em{F',o}em{f2,t}...abstractselectionabstractselectionabstracttransitionselectiontransitiontransitions2s0s1PSfrag
replacementsS 0S 1S
2startsc{1}sc{2}sc{Y}hc{1}hc{2}hc{X}sc{Z}O
1O
2O
2abstract statestatesFigure 3:Trellis induced by the LOHMM in Figure 1.
The sets of reachable states at time
0; 1; : : : are denoted by S
0
; S 1
; : : : In contrast with HMMs, thereis an additional lay er where the states
are sampled from abstract states.We will now address each of these problems
in turn by upgrading the existing solutions for
HMMs. This will b e realized by computing a grounded trellis as in Figure
3. The p ossible
ground successor states of any given state are computed by first selecting
the applicable abstract transitions and then applying the selection probabilities
{while taking into account
the substitutions} to ground the resulting states. This tw o-step factorization
is coalesced into one step for HMMs. To evaluate O, consider the probability
of the partial observation sequence O
1
; O 2
; : : : ; O t
and {ground} state S at time t, 0 < t 024 T , given the mo del M = {006;
026; 001; 007}
ff t
{S} := P {O
1
; O
2
; : : : ; O
t
; q t
= S j M }
where q
t = S denotes that the system is in state S at time t. As for HMMs, ff
t {S} can b e com-
puted using a dynamic programming approach. For t = 0, we set ff
0
{S} = P {q
0
= S j M } ,
i.e., ff
0
{S} is the probability of starting in state S and, for t > 0, we compute
ff
t
{S} based on ff
t0001
{S
0 }:1: S 0
:= fstartg /* initialize the set of re achable states*/
2: for t = 1; 2; : : : ; T do
3:S
t = ; /* initialize the set of re achable states at clock t*/
4:foreach S 2 S t0001
do 5:foreach maximally sp ecific p : H
O
000 B 2 001 [ 007 s.t. 033
B
= mgu{S; B} exists do6:foreach S
0
= H033 B
033
H 2 G
006
{H033
B
} s.t. O
t0001
unifies with O033 B
033
H do7:if S 0
62 S
t
then
8:S
t
:= S
t
[ fS
0
g 9:ff
t
{S
0 } := 0:0
10:ff
t
{S
0
} := ff t
{S
0
} + ff
t0001
{S} 001 p 001026{S
0
j H033
B
} 001 026{O
t0001
j O033 B
033
H }11: return P {O j M } =
P
S2S
T
ff
T
{S}432Logical Hidden Marko v Modelswhere we assume for the sake of simplicity
O 021 start for each abstract transition p : H
start 2 007. Furthermore, the b oxed parts sp ecify all the differences
to the HMM formula: unification and 026 are taken into account.
Clearly, as for HMMs P {O j M } = P
S2S
T
ff
T
{S} holds. The computational complexity
of this forward pro c e dur e is O {T 001 s 001 {jBj + o 001 g }) =
O {T 001 s
2
} where s = max
t=1;2;:::;T
jS t
j ,
o is the maximal num b er of outgoing abstract transitions with regard to
an abstract state,
and g is the maximal num b er of ground instances of an abstract state.
In a completely
analogous manner, one can devise a backwar d pro c e dur e to compute fi
t
{S} = P {O t+1
; O t+2
; : : : ; O
T
j q t
= S; M } :
This will b e useful for solving Problem {3}.
Having a forward pro cedure, it is straightforw ard to adapt the Viterbi
algorithm as a solution to Problem {2}, i.e., for computing the most likely
state sequence. Let ffi t
{S} denote the highest probability along a single path at time t which accounts
for the first t
observations and ends in state S, i.e.,
ffi
t
{S} = max
S
0
;S
1
;:::;S
t0001
P {S
0
; S
1
; : : : ; S
t0001
; S
t
= S; O
1
; : : : ; O
t0001
jM } : The pro cedure for finding the most likely state sequence basically
follows the forward pro-
cedure. Instead of summing ov er all ground transition probabilities in
line 10, we maximize
ov er them. More precisely, we pro ceed as follows:1:S
0 := fstartg /* initialize the set of re achable states*/
2: for t = 1; 2; : : : ; T do
3:S
t
= ; /* initialize the set of re achable states at clock t*/ 4:foreach S
2 S
t0001
do
5:foreach maximally sp ecific p : H
O
000 B 2 001 [ 007 s.t. 033
B
= mgu{S; B} exists do 6:foreach S
0
= H033 B
033
H 2 G
006
{H033
B
} s.t. O
t0001
unifies with O033 B
033
H do
7:if S 0
62 S
t
then
8:S
t
:= S
t
[ fS
0
g 9:ffi
t
{S; S 0
} := 0:0
10:ffi
t {S; S
0
} := ffi
t
{S; S 0
} + ffi t0001
{S} 001 p 001 026{S
0
j H033
B
} 001 026{O
t0001
j O033 B
033
H }
11:foreach S
0 2 S
t
do
12:ffi
t
{S
0
} = max S2S
t0001
ffi
t
{S; S
0 }
13:
t
{S
0
} = arg max
S2S
t0001
t
{S; S 0
}
Here, ffi
t
{S; S
0
} stores the probability of making a transition from S to S
0 and
t
{S
0
} {with
1
{S } = start for all states S } keeps track of the state maximizing the
probability along
a single path at time t which accounts for the first t observations and
ends in state S
0
. The most likely hidden state sequence S
003
can now b e computed as
S
003
T +1 = arg max
S2S
T +1
ffi
T +1
{S}
and S 003
t
=
t
{S 003
t+1
} for t = T ; T 000 1; : : : ; 1 :
One can also consider problem {2} on a more abstract level. Instead of considering
all contributions of different abstract transitions T to a single ground
transition from state S433Kersting, De Raedt, & Raikoto state S
0
in line 10, one might also consider the most likely abstract transition
only. This
is realized by replacing line 10 in the forward pro cedure with
ff t
{S
0 } := max{ff
t
{S
0 }; ff
t0001
{S} 001 p 001 026{S
0
j H033
B
} 001 026{O
t0001
j O033
B
033
H
}) : This solves the problem of finding the {2
0
} most likely state and abstract transition
sequence:Determine the sequence of states and abstract transitions GT
003 =
S
0 ; T
0
; S
1
; T 1
; S
2
; : : : ; S T
; T
T
; S
T+1
where there exists substitutions 022
i
with S i+1
S
i
021 T i
022
i that has most likely pro duced the observation sequence O, i.e.
GT
003
= arg max GT
P {GT j O; M } : Thus, logical hidden Marko v mo dels also p ose new typ
es of inference problems. For parameter estimation, we hav e to estimate
the maximum likeliho o d transition
probabilities and selection distributions. To estimate the former, we upgrade
the well-kno wn Baum-Welch algorithm {Baum, 1972} for estimating the maximum
likeliho o d parameters
of HMMs and probabilistic context-free grammars.
For HMMs, the Baum-Welch algorithm computes the improv ed estimatep of the
tran-
sition probability of some {ground} transition T 021 p : H
O 000 B by taking the ratiop =
030 {T}P H
0
O 0
000B2001[007
030 {T
0
} {3}
b etw een the exp ected num b er 030 {T} of times of making the transitions
T at any time given
the mo del M and an observation sequence O, and the total num b er of times
a transitions
is made from B at any time given M and O.
Basically the same applies when T is an abstract transition. How ev er,
we hav e to b e a little bit more careful b ecause we hav e no direct access
to 030 {T}. Let 030
t
{gcl; T} b e the
probability of following the abstract transition T via its ground instance
gcl 021 p : GH GO
000 000 GB
at time t, i.e., 030
t
{gcl; T} =
ff t
{GB} 001 p 001 fi
t+1
{GH}P {O j M }
001026{GH j H033
B
} 001 026{O
t0001 j O033
B 033
H
}; {4}
where 033
B
; 033
H are as in the forward pro cedure {see ab ov e} and P {O j M } is the probability
that the mo del generated the sequence O. Again, the b oxed terms constitute
the main
difference to the corresp onding HMM formula. In order to apply Equation
{3} to compute
improv ed estimates of probabilities asso ciated with abstract transitions,
we set 030 {T} = T
X
t=1
030
t {T} =
T X
t=1
X
gcl
030 t
{gcl; T}
where the inner sum runs ov er all ground instances of T.
This leads to the following re-estimation metho d, where we assume that
the sets S i
of
reachable states are reused from the computations of the ff- and fi -values:434Logical
Hidden Marko v Models1: /* initialization of expe cte d counts */
2: foreach T 2 001 [ 007 do
3:030 {T} := m /* or 0 if not using pseudoc ounts */ 4: /* compute expe
cte d counts */
5: for t = 0; 1; : : : ; T do
6:foreach S 2 S
t
do
7:foreach max. sp ecific T 021 p : H
O
000 B 2 001 [ 007 s.t. 033 B
= mgu{S; B} exists do8:foreach S
0
= H033
B
033 H
2 G
006
{H033
B
} s.t. S 0
2 S
t+1
^ mgu{O
t
; O033
B
033 H
} exists do9:030 {T} := 030 {T} + ff t
{S} 001 p 001 fi
t+1 {S
0
}
ffi
P {O j M }001026{S
0
j H033
B } 001 026{O t0001
j O033
B
033 H
}Here, equation {4} can b e found in line 9. In line 3, we set pseudo counts
as small sample-
size regularizers. Other metho ds to av oid a biased underestimate of probabilities
and even
zero probabilities such as m-estimates {see e.g., Mitchell, 1997} can b
e easily adapted.
To estimate the selection probabilities, recall that 026 follows a naª020ve
Bay es scheme. There-
fore, the estimated probability for a domain element d 2 D for some domain
D is the ratio
b etw een the num b er of times d is selected and the num b er of times
any d
0
2 D is selected.
The pro cedure for computing the 030 -values can thus b e reused. Altogether,
the Baum-Welch algorithm works as follows: While not conv erged, {1} es-
timatethe abstract transition probabilities, and {2} the selection probabilities.
Since it is
an instance of the EM algorithm, it increases the likeliho o d of the data
with every up date,
and according to McLachlan and Krishnan {1997}, it is guaranteed to reach
a stationary
p oint. All standard techniques to ov ercome limitations of EM algorithms
are applicable.
The computational complexity {p er iteration} is O {k 001 {ff + d}) =
O {k 001 T 001 s 2
+ k 001 d} where
k is the num b er of sequences, ff is the complexity of computing the ff-values
{see ab ov e},
and d is the sum ov er the sizes of domains asso ciated to predicates. Recently,
Kersting
andRaiko {2005} combined the Baum-Welch algorithm with structure search
for mo del
selection of logical hidden Marko v mo dels using inductive logic pro gr
amming {Muggleton &De Raedt, 1994} refinement op erators. The refinement
op erators account for different
abstraction levels which hav e to b e explored.
5. Advantages of LOHMMs
In this section, we will inv estigate the b enefits of LOHMMs: {1} LOHMMs
are strictly
more expressive than HMMs, and {2}, using abstraction, logical variables
and unification
can b e b eneficial. More sp ecifically, with {2}, we will show that{B1}LOHMMs
can b e - by design - smaller than their prop ositional instantiations,
and{B2}unification can yield b etter log-likeliho o d estimates.
5.1 On the Expressivity of LOHMMs
Whereas HMMs sp ecify probability distributions ov er regular languages,
LOHMMs sp ecify
probability distributions ov er more expressive languages.435Kersting, De
Raedt, & RaikoTheorem 2For any {consistent} prob abilistic context-fr e e
grammar {PCFG} G for some
language L there exists a LOHMM M s.t. P
G {w } = P
M
{w } for al l w 2 L.
The pro of {see App endix C} makes use of abstract states of unb ounded
'depth'. More
precisely, functors are used to implement a stack. Without functors, LOHMMs
cannot
enco de PCFGs and, b ecause the Herbrand base is finite, it can b e prov
en that there alwa ys
exists an equivalent HMM. Furthermore, if functors are allow ed, LOHMMs
are strictly more expressive than PCFGs.
They can sp ecify probability distributions ov er some languages that are
context-sensitiv e:
1:0 : stack{s{0}; s{0}) start
0:8 : stack{s{X}; s{X})
a 000 stack{X; X}
0:2 : unstack{s{X}; s{X}) a
000 stack{X; X}
1:0 : unstack{X; Y}
b
000 unstack{s{X}; Y} 1:0 : unstack{s{0}; Y} c
000 unstack{s{0}; s{Y})
1:0 : end
end
000 000 unstack{s{0}; s{0})
The LOHMM defines a distribution ov er fa
n b
n
c n
j n > 0g.
Finally, the use of logical variables also enables one to deal with identifiers.
Identifiers are sp ecial typ es of constants that denote ob jects. Indeed,
recall the UNIX command sequence emacs lohmms:tex; ls; latex lohmms:tex;
: : : from the intro duction. The filename
lohmms:tex is an identifier. Usually, the sp ecific identifiers do not matter
but rather the
fact that the same ob ject o ccurs multiple times in the sequence. LOHMMs
can easily deal with identifiers by setting the selection probability 026
to a constant for the arguments in
which identifiers can o ccur. Unification then takes care of the necessary
variable bindings.
5.2 Benefits of Abstraction through Variables and Unification
Reconsider the domain of UNIX command sequences. Unix users oftenly reuse
a newly cre- ated directory in subsequent commands such as in mkdir{vt100x};
cd{vt100x}, ls{vt100x} .
Unification should allow us to elegantly employ this information b ecause
it allows us to sp ec-
ify that, after observing the created directory, the mo del makes a transition
into a state
where the newly created directory is used: p
1
: cd{Dir; mkdir} mkdir{Dir; com} andp
2
: cd{; mkdir} mkdir{Dir; com}
If the first transition is follow ed, the cd command will mov e to the newly
created directory; if the second transition is follow ed, it is not sp ecified
which directory cd will mov e to. Thus,
the LOHMM captures the reuse of created directories as an argument of future
commands. Moreov er, the LOHMM enco des the simplest p ossible case to show
the b enefits of unifica- tion. At any time, the observation sequence uniquely
determines the state sequence, and
functors are not used. Therefore, we left out the abstract output symb ols
asso ciated with
abstract transitions. In total, the LOHMM U , mo delling the reuse of directories,
consists
of 542 parameters only but still cov ers more than 451000 {ground} states,
see App endix D for the complete mo del. The compression in the num b er
of parameters supp orts {B1}.
To empirically inv estigate the b enefits of unification, we compare U with
the variant N
of U where no variables are shared, i.e., no unification is used such that
for instance the436Logical Hidden Marko v Modelsfirst transition ab ov e
is not allow ed, see App endix D. N has 164 parameters less than U .
We computed the following zero-one win function f {O} = {
1 if
002
log P
U {O} 000 log P
N
{O}
003
> 0 0 otherwise
leav e-one-out cross-validated on Unix shell logs collected by Greenb erg
{1988}. Overall,
the data consists of 168 users of four groups: computer scientists, nonprogrammers,
novices
and others. Ab out 300000 commands hav e b een logged with an av erage of
110 sessions p er user. We present here results for a subset of the data.
We considered all computer
scientist sessions in which at least a single mkdir command app ears. These
yield 283 logical sequences ov er in total 3286 ground atoms. The LOO win
was 81:63045. Other LOO statistics are also in fav or of U :trainingtestlog
P {O
O
O}log
P
U
{O O
O}P
N
{ O
O
O}log P {O}log P
U
{O}P N
{O}U00011361:01795:300042:87:91N00013157:000050:7Thus, although U has
164 parameters more than N , it shows a b etter generalization p er-
formance. This result supp orts {B2}. A pattern often found in U was
1
0:15 : cd{Dir; mkdir} mkdir{Dir; com} and0:08 : cd{; mkdir} mkdir{Dir;
com}
fav oring changing to the directory just made. This knowledge cannot b e
captured in N
0:25 : cd{; mkdir} mkdir{Dir; com}: The results clearly show that abstraction
through variables and unification can b e b eneficial for some applications,
i.e., {B1} and {B2} hold.
6. Real World Applications
Our inten tions here are to inv estigate whether LOHMMs can b e applied
to real world
domains. More precisely, we will inv estigate whether b enefits {B1} and
{B2} can also b e
exploited in real world application domains. Additionally, we will inv estigate
whether{B3}LOHMMs are comp etitive with ILP algorithms that can also utilize
unification and
abstraction through variables, and{B4}LOHMMs can handle tree-structured
data similar to PCFGs. To this aim, we conducted exp eriments on tw o bioinformatics
application domains: protein
fold recognition {Kersting, Raiko, Kramer, & De Raedt, 2003} and mRNA signal
structure
detection {Horvath, Wrob el, & Bohneb eck, 2001}. Both application domains
are multiclass
problems with five different classes each.1. The sum of probabilities is
not the same {0:15 + 0:08 = 0:23 6= 0:25} b ecause of the use of pseudo counts
and b ecause of the subliminal non-determinism {w.r.t. abstract states}
in U , i.e., in case that the first transition fires, the second one also
fires.437Kersting, De Raedt, & Raiko6.1 Metho dology
In order to tackle the multiclass problem with LOHMMs, we follow ed a plug-in
estimate
approach. Let fc 1
; c
2 ; : : : ; c
k g b e the set of p ossible classes. Given a finite set of training
examples f{x
i
; y i
}g
n i=1
022 X 002 fc
1
; c 2
; : : : ; c n
g, one tries to find f : X ! fc 1
; c
2 ; : : : ; c
k g
f {x} = arg max
c2fc 1
;c
2 ;:::;c
k
g
P {x j M ; 025
003
c
} 001 P {c} : {5} with low approximation error on the training data as
well as on unseen examples. In
Equation {5}, M denotes the mo del structure which is the same for all classes,
025
003 c
denotes
the maximum likeliho o d parameters of M for class c estimated on the training
examples
with y
i
= c only, and P {c} is the prior class distribution.
We implemented the Baum-Welch algorithm {with pseudo counts m, see line
3} for maxi-
mum likeliho o d parameter estimation using the Prolog system Yap-4:4:4.
In all exp eriments,
w e set m = 1 and let the Baum-Welch algorithm stop if the change in log-likeliho
o d was
less than 0:1 from one iteration to the next. The exp eriments were ran
on a Pen tium-IV
3:2 GHz Linux machine. 6.2 Protein Fold Recognition Protein fold recognition
is concerned with how proteins fold in nature, i.e., their three-
dimensional structures. This is an imp ortant problem as the biological
functions of proteins
dep end on the wa y they fold. A common approach is to use database searches
to find pro-
teins {of known fold} similar to a newly discov ered protein {of unknown
fold}. To facilitate protein fold recognition, several exp ert-based classification
schemes of proteins hav e b een
develop ed that group the current set of known protein structures according
to the similarity of their folds. For instance, the structural classification
of proteins {Hubbard, Murzin, Bren-
ner, & Chotia, 1997} {SCOP} database hierarchically organizes proteins according
to their structures and evolutionary origin. From a machine learning p ersp
ective, SCOP induces a classification problem: given a protein of unknown
fold, assign it to the b est matching group
of the classification scheme. This protein fold classification problem has
b een inv estigated
b y Turcotte, Muggleton, and Sternb erg {2001} based on the inductive logic
programming
{ILP} system PROGOL and by Kersting et al. {2003} based on LOHMMs.
The secondary structure of protein domains 2
can elegantly b e represented as logical se- quences. For example, the secondary
structure of the Rib osomal protein L4 is represented as
st{null; 2}; he{right; alpha; 6} ; st{plus; 2}; he{right; alpha; 4} ; st{plus;
2};
he{right; alpha; 4} ; st{plus; 3}; he{right; alpha; 4}; st{plus; 1}; he{hright;
alpha; 6}
Helices of a certain typ e, orientation and length he{HelixType ; HelixOrientation
; Length },
and strands of a certain orientation and length st{StrandOrientation; Length
} are atoms ov er
logical predicates. The application of traditional HMMs to such sequences
requires one to either ignore the structure of helices and strands, which
results in a loss of information, or to
take all p ossible combinations {of arguments such as orientation and length}
into account,
whic h leads to a combinatorial explosion in the num b er of parameters2.
A domain can b e viewed as a sub-section of a protein which app ears in a
num b er of distantly related proteins and which can fold indep endently
of the rest of the protein.438Logical Hidden Marko v ModelsKristian Kerstingblock{B;
0}block{s{B}; 0}block{B; s{s{s{0})})block{B; s{P})block{B; P}block{s{B};
P}block{s{B}; s{P})Blo ck B of length 3Blo ck s{B} of length 2block{s{B};
s{0})Transition to next blo ckDynamics within blo ckDynamics within blo
ckTransition to next blo ckendFigure 4: Scheme of a left-to-right LOHMM blo
ck mo del.The results rep orted by Kersting et al. {2003} indicate that LOHMMs
are well-suited for protein fold classification: the num b er of parameters
of a LOHMM can by an order of
magnitude b e smaller than the num b er of a corresp onding HMM {120 versus
approximately 62000} and the generalization p erformance, a 74045 accuracy,
is comparable to Turcotte
et al.'s {2001} result based on the ILP system Progol, a 75045 accuracy.
Kersting et al.
{2003}, how ev er, do not cross-validate their results nor inv estigate
{ as it is common in
bioinformatics { the impact of primary sequence similarity on the classification
accuracy. For
instance, the tw o most commonly requested ASTRAL subsets are the subset
of sequences
with less than 95045 identit y to each other {95 cut} and with less than
40045 identit y to each
other {40 cut}. Motivated by this, we conducted the following new exp eriments.
The data consists of logical sequences of the secondary structure of protein
domains. As
in the work of Kersting et al. {2003}, the task is to predict one of the
five most p opulated SCOP folds of alpha and b eta proteins {a/b}: TIM b
eta/alpha-barrel {fold 1}, NAD{P}-
binding Rossmann-fold domains {fold 2}, Rib osomal protein L4 {fold 23},
Cysteine hydrolase
{fold 37}, and Phosphotyrosine protein phosphatases I-like {fold 55}. The
class of a/b
proteins consists of proteins with mainly parallel b eta sheets {b eta-alpha-b
eta units}. The
data hav e b een extracted automatically from the ASTRAL dataset version
1:65 {Chandonia, Hon, Walker, Lo Conte, P.Ko ehl, & Brenner, 2004} for the
95 cut and for the 40 cut. As
in the work of Kersting et al. {2003}, we consider strands and helices only,
i.e., coils and
isolated strands are discarded. For the 95 cut, this yields 816 logical
sequences consisting ofin total 22210 ground atoms. The num b er of sequences
in the classes are listed as 293,
151, 87, 195, and 90. For the 40 cut, this yields 523 logical sequences
consisting of in total
14986 ground atoms. The num b er of sequences in the classes are listed
as 182, 100, 66, 122, and 53.
LOHMM structure: The used LOHMM structure follows a left-to-right blo ck
top ology,
see Figure 4, to mo del blo cks of consecutive helices {resp. strands}.
Being in a Block of
some size s, say 3, the mo del will remain in the same blo ck for s = 3
time steps. A similar
idea has b een used to mo del haplotyp es {Koivisto, Perola, Varilo, Hennah,
Ekelund, Lukk,
Peltonen, Ukkonen, & Mannila, 2002; Koivisto, Kivio ja, Mannila, Rastas,
& Ukkonen, 2004}. In contrast to common HMM blo ck mo dels {Won, Pr ªugel-Bennett,
& Krogh, 2004},439Kersting, De Raedt, & Raikothe transition parameters are
shared within each blo ck and one can ensure that the mo del
makes a transition to the next state s{Block } only at the end of a blo
ck; in our example
after exactly 3 intra-blo ck transitions. Furthermore, there are sp ecific
abstract transitions
for all helix typ es and strand orientations to mo del the priori distribution,
the intra- and
the inter-blo ck transitions. The num b er of blo cks and their sizes were
chosen according
to the empirical distribution ov er sequence lengths in the data so that
the b eginning and the ending of protein domains was likely captured in detail.
This yield the following blo ck
structureKristian Kersting12414647616276771920272840......where the num
b ers denote the p ositions within protein domains. Furthermore, note that
the last blo ck gathers all remaining transitions. The blo cks themselves
are mo delled using hidden abstract states ov er hc{HelixType ; HelixOrientation
; Length ; Block } and sc{StrandOrientation ; Length ; Block } :
Here, Length denotes the num b er of consecutive bases the structure element
consists of. The length was discretized into 10 bins such that the original
lengths were uniformally
distributed. In total, the LOHMM has 295 parameters. The corresp onding
HMM without parameter sharing has more than 65200 parameters. This clearly
confirms {B1}.
Results: We p erformed a 10-fold cross-validation. On the 95 cut dataset,
the accuracy was
76045 and to ok approx. 25 minutes p er cross-validation iteration; on
the 40 cut, the accuracy
was 73045 and to ok approx. 12 minutes p er cross-validation iteration.
The results validate
Kersting et al.'s {2003} results and, in turn, clearly show that {B3} holds.
Moreov er, the nov el results on the 40 cut dataset indicate that the similarities
detected by the LOHMMs
b etw een the protein domain structures were not accompanied by high sequence
similarity.
6.3 mRNA Signal Structure Detection
mRNA sequences consist of bases {guanine, adenine, uracil, cytosine} and
fold intramolec-
ularly to form a num b er of short base-paired stems {Durbin, Eddy, Krogh,
& Mitchison, 1998}. This base-paired structure is called the sec ondary structure,
cf. Figures 5 and 6. The secondary structure contains sp ecial subsequences
called signal structures that are resp onsi-
ble for sp ecial biological functions, such as RNA-protein interactions
and cellular transp ort.
The function of each signal structure class is based on the common characteristic
binding
site of all class elements. The elements are not necessarily identical but
very similar. They
can vary in top ology {tree structure}, in size {num b er of constituting
bases}, and in base
sequence.
The goal of our exp eriments was to recognize instances of signal structures
classes in
mRNA molecules. The first application of relational learning to recognize
the signal struc-
ture class of mRNA molecules was describ ed in the works of Bohneb eck,
Horvath, and Wrob el {1998} and of Horvath et al. {2001}, where the relational
instance-based learner
RIBL was applied. The dataset 3
we used was similar to the one describ ed by Horvath3. The dataset is not
the same as describ ed in the work by Horvath et al. {2001} b ecause we
could not obtain
the original dataset. We will compare to the smaller data set used by Horvath
et al., which consisted of440Logical Hidden Marko v Modelset al. {2001}.
It consisted of 93 mRNA secondary structure sequences. More precisely, it
was
comp osed of 15 and 5 SECIS {Seleno cysteine Insertion Sequence}, 27 IRE
{Iron Resp onsive
Elemen t}, 36 TAR {Trans Activating Region} and 10 histone stem lo ops constituting
five
classes. The secondary structure is comp osed of different building blo
cks such as stacking region,
hairpin lo ops, interior lo ops etc. In contrast to the secondary structure
of proteins that forms
chains, the secondary structure of mRNA forms a tree. As trees can not easily
b e handled
using HMMs, mRNA secondary structure data is more challenging than that
of proteins.
Moreov er, Horvath et al. {2001} rep ort that making the tree structure
available to RIBL
as background knowledge had an influence on the classification accuracy.
More precisely,
using a simple chain representation RIBL achiev ed a 77:2045 leav e-one-out
cross-validation
{LOO} accuracy whereas using the tree structure as background knowledge
RIBL achiev ed
a 95:4045 LOO accuracy.
W e follow ed Horvath et al.'s exp erimental setup, that is, we adapted
their data repre-
sentations to LOHMMs and compared a chain mo del with a tree mo del.
Chain Representation: In the chain representation {see also Figure 5},
signal structures are describ ed by single{TypeSingle ; Position ; Acid
} or
helical{TypeHelic al ; Position ; Acid ; Acid }. Dep ending on its typ e,
a structure el-
ement is represented by either single=3 or helical=4. Their first argument
TypeSingle {resp. TypeHelic al } sp ecifies the typ e of the structure element,
i.e.,
single; bulge3; bulge5; hairpin {resp. stem}. The argument Position is the
p osi- tion of the sequence element within the corresp onding structure element
counted down,
i.e.
4
, fn
13
{0}; n
12
{0}; : : : ; n
1
{0}g. The maximal p osition was set to 13 as this was the maximal p osition
observed in the data. The last argument enco des the observed nucleotide
{pair}.
The used LOHMM structure follows again the left-to-right blo ck structure
shown in
Figure 4. Its underlying idea is to mo del blo cks of consecutive helical
structure ele-
ments. The hidden states are mo delled using single{TypeSingle ; Position
; Acid ; Block }
and helical{TypeHelic al ; Position ; Acid ; Acid ; Block }. Being in a
Block of consecutive he-
lical {resp. single} structure elements, the mo del will remain in the Block
or transition to a single element. The transition to a single {resp. helical}
element only o ccurs at Position
n{0}. At all other p ositions n{Position }, there were transitions from
helical {resp. single}
structure elements to helical {resp. single} structure elements at Position
capturing the dy- namics of the nucleotide pairs {resp. nucleotides} within
structure elements. For instance,66 signal structures and is very close to
our data set. On a larger data set {with 400 structures} Horvath
et al. rep ort an error rate of 3:8045 .
4. n m
{0} is shorthand for the recursive application of the functor n on 0 m times,
i.e., for p osition m.441Kersting, De Raedt, & Raiko1.0rr
1.1rr
1.2rr
1.3rr1.4rr
1.5rr
1.6rr
2.0rr3.0rr
3.1rr
4.0rr
4.1rr4.2rr
5.0rr
5.1rr
5.2rr6.0ll
6.1ll
6.2ll
7.0llKristian Kerstingauuacauauguaaaccggcgcgcgauuaagaa 4.06.26.07.01.11.21.31.41.51.61.02.03.13.04.14.25.05.15.26.1PSfrag
replacementshelical{stem; n{n{n{n{n{n{n{0})})})}; a; u}:helical{stem;
n{n{n{n{n{n{0})})}); u; a}:helical{stem; n{n{n{n{n{0})})}; c; a}:helical{stem;
n{n{n{n{0})}); u; a}:helical{stem; n{n{n{0})}; u; g}:helical{stem; n{n{0});
u; a}:helical{stem; n{0}; a; a}:single{bulge5; n{0}; a}:helical{stem; n{n{0});
c; g}:helical{stem; n{0}; c; g}:single{bulge5; n{n{n{0})}; g}:single{bulge5;
n{n{0}); a}:single{bulge5; n{0}; a}:helical{stem; n{n{n{0})}; c; g}:helical{stem;
n{n{0}); c; g}:helical{stem; n{0}; c; g}:single{hairpin; n{n{n{0})}; a}:single{hairpin;
n{n{0}); u}:single{hairpin; n{0}; u}:single{bulge3; n{0}; a}:Figure 5:The
chain representation of a SECIS signal structure. The ground atoms are
ordered clo ckwise starting with helical{stem; n{n{n{n{n{n{n{0})})})};
a; u} at the
low er left-hand side corner.the transitions for blo ck n{0} at p osition
n{n{0}) were
a : he{stem; n{0}; X; Y; n{0}) p
a
:he{stem;n{0};X;Y}
000000 000 000 000 000 000 000 000 000 000000 he{stem; n{n{0});
X; Y; n{0})}
b : he{stem; n{0}; Y; X; n{0})
p
b
:he{stem;n{0};X;Y}
000000 000 000 000 000 000 000 000 000 000000 he{stem; n{n{0});
X; Y; n{0})} c : he{stem; n{0}; X;; n{0})
p
c :he{stem;n{0};X;Y} 000000 000 000 000 000 000 000 000 000 000000
he{stem; n{n{0}); X; Y; n{0})}
d : he{stem; n{0};; Y; n{0})
p
d
:he{stem;n{0};X;Y}
000000 000 000 000 000 000 000 000 000 000000 he{stem; n{n{0});
X; Y; n{0})}
e : he{stem; n{0};;; n{0})
p e
:he{stem;n{0};X;Y}
000000 000 000 000 000 000 000 000 000 000000 he{stem; n{n{0});
X; Y; n{0})} In total, there were 5 p ossible blo cks as this was the maximal
num b er of blo cks of consecutive
helical structure elements observed in the data. Overall, the LOHMM has
702 parameters. In contrast, the corresp onding HMM has more than 16600 transitions
validating {B1}.
Results: The LOO test log-likeliho o d was 00063:7, and an EM iteration
to ok on av erage
26 seconds.
Without the unification-based transitions b-d, i.e., using only the abstract
transitions
a : he{stem; n{0}; X; Y; n{0})
p
a
:he{stem;n{0};X;Y} 000000 000 000 000 000 000 000 000 000 000000
he{stem; n{n{0}); X; Y; n{0})}
e : he{stem; n{0};;; n{0}) p
e
:he{stem;n{0};X;Y}
000000 000 000 000 000 000 000 000 000 000000 he{stem; n{n{0});
X; Y; n{0})} ;
the mo del has 506 parameters. The LOO test log-likeliho o d was 00064:21,
and an EM iter-
ation to ok on av erage 20 seconds. The difference in LOO test log-likeliho
o d is statistically
significant {paired t-test, p = 0:01}.
Omitting even transition a, the LOO test log-likeliho o d dropp ed to 00066:06,
and the
av erage time p er EM iteration was 18 seconds. The mo del has 341 parameters.
The
difference in av erage LOO log-likeliho o d is statistically significant
{paired t-test, p = 0:001}.
The results clearly show that unification can yield b etter LOO test log-likeliho
o ds, i.e., {B2} holds.442Logical Hidden Marko v Models0rr
1.0rr
1.1rr
1.2rr1.3rr
1.4rr
1.5rr
1.6rr1.7rr
2.0rr
2.1rr
3.0rr3.1rr
3.2rr
4.0rr
4.1rr4.2rr
4.3rr
5.0rr
5.1rr5.2rr
5.3rr
6.0ll
6.1ll6.2ll
6.3ll
6.4ll
7.0ll7.1ll
7.2ll
T0cc
T1ccT2cc
T3cc
T4rr
T5ccT6cc
T7ll
Kristian Kerstingauuacauauguaaa5.05.15.25.34.14.24.34.03.13.23.02.02.11.11.21.31.41.51.61.71.06.46.36.26.06.17.07.27.1T0T1T5T2T3T6T7T4ccggcgcgcgauuaagaa
0PSfrag replacementsroot{0; root; [c]}:helical{s{0}; 0; [c; c]; stem; n{n{n{n{n{n{n{0})})})}):nucleotidepair{(a;
u}):nucleotidepair{(u; a}):nucleotidepair{(c; a}):nucleotidepair{(u;
a}):nucleotidepair{(u; g}):nucleotidepair{(u; a}):nucleotidepair{(a;
a}):single{s{s{0}); s{0}; []; bulge5; n{0}):nucleotide{a}:helical{s{s{s{0})};
s{0}; [c; c; c]; stem; n{n{0})}:nucleotidepair{(c; g}):nucleotidepair{(c;
g}):single{s{s{s{s{0})}); s{s{s{0})}; []; bulge5; n{n{n{0})}):nucleotide{g}:nucleotide{a}:nucleotide{a}:helical{s{s{s{s{s{0})})};
s{s{s{0})}; [c]; stem; n{n{n{0})}):nucleotidepair{(c; g}):nucleotidepair{(c;
g}):nucleotidepair{(c; g}):single{s{s{s{s{s{s{0})})}); s{s{s{s{s{0})})};[];
hairpin; n{n{n{0})}):nucleotide{a}:nucleotide{u}:nucleotide{u}:single{s{s{s{s{s{s{s{0})})})};s{s{s{0})};[];
bulge3; n{0}):nucleotide{a}:0s{0}s{s{0})s{s{s{0})}s{s{s{s{0})})s{s{s{s{s{0})})s{s{s{s{s{s{0})})})s{s{s{s{s{s{s{0})})})}Figure
6:The tre e representation of a SECIS signal structure. {a} The logical sequence,
i.e., the sequence of ground atoms representing the SECIS signal structure.
The ground atoms are ordered clo ckwise starting with root{0; root; [c]}
in the low er
left-hand side corner. {b} The tree formed by the secondary structure elements.Tree
Representation: In the tre e representation {see Figure 6 {a}}, the idea
is to capture the tree structure formed by the secondary structure elements,
see Figure 6 {b}. Each
training instance is describ ed as a sequence of ground facts ov er
root{0; root; #Children };
helical{ID ; ParentID ; #Children ; Type ; Size };
nucleotidepair{BasePair };
single{ID ; ParentID ; #Children ; Type ; Size };
nucleotide{Base } : Here, ID and ParentID are natural num b ers 0; s{0};
s{s{0}); : : : enco ding the child-
paren t relation, #Children denotes the num b er
5
of children []; [c]; [c; c]; : : :, Type is the typ e of the structure element
such as stem; hairpin; : : :, and Size is a natural num b er
0; n{0}; n{n{0}); : : : Atoms root{0; root; #Children } are used to ro
ot the top ology. The
maximal #Children was 9 and the maximal Size was 13 as this was the maximal
value
observed in the data. As trees can not easily b e handled using HMMs, we
used a LOHMM which basically
enco des a PCFG. Due to Theorem 2, this is p ossible. The used LOHMM structure
can b e
found in App endix E. It pro cesses the mRNA trees in in-order. Unification
is only used for parsing the tree. As for the chain representation, we used
a Position argument in the hidden
states to enco de the dynamics of nucleotides {nucleotide pairs} within
secondary structure5. Here, we use the Prolog short hand notation [001]
for lists. A list either is the constant [] representing the empty list,
oris a comp ound term with functor :=2 and tw o arguments, which are resp
ectively the head and tail of the list. Thus [a; b; c] is the comp ound term
:{a; :{b; :{c; []})}.443Kersting, De Raedt, & Raikoelements. The maximal
Position was again 13. In contrast to the chain representation,
n ucleotide pairs such as {a; u} are treated as constants. Thus, the argument
BasePair consists of 16 elements. Results: The LOO test log-likeliho o d
was 00055:56. Thus, exploiting the tree structure
yields b etter probabilistic mo dels. On av erage, an EM iteration to ok
14 seconds. Overall,
the result shows that {B4} holds.
Although the Baum-Welch algorithm attempts to maximize a different ob jective
func-
tion, namely the likeliho o d of the data, it is interesting to compare
LOHMMs and RIBL in
terms of classification accuracy.
Classification Accuracy: On the chain representation, the LOO accuracies
of all
LOHMMs were 99045 {92=93}. This is a considerable improv emen t on RIBL's
77:2045 {51=66}
LOO accuracy for this representation. On the tre e representation, the LOHMM
also achiev ed a LOO accuracy of 99045 {92=93}. This is comparable to RIBL's
LOO accuracy of
97045 {64=66} on this kind of representation. Th us, already the chain
LOHMMs show marked increases in LOO accuracy when com-
pared to RIBL {Horvath et al., 2001}. In order to achiev e similar LOO
accuracies, Horvath
et al. {2001} had to make the tree structure available to RIBL as background
knowledge.
For LOHMMs, this had a significant influence on the LOO test log-likeliho
o d, but not on
the LOO accuracies. This clearly supp orts {B3}. Moreov er, according to
Horvath et al., the mRNA application can also b e considered a success in
terms of the application domain,
although this was not the primary goal of our exp eriments. There exist
also alternative
parameter estimation techniques and other mo dels, such as covariance mo
dels {Eddy &
Durbin, 1994} or pair hidden Marko v mo dels {Sakakibara, 2003}, that might
hav e b een
used as well as a basis for comparison. How ev er, as LOHMMs employ {inductive}
logic pro- gramming principles, it is appropriate to compare with other systems
within this paradigm such as RIBL.
7. Related Work
LOHMMs combine tw o different research directions. On the one hand, theyare
related to several extensions of HMMs and probabilistic grammars. On the
other hand, they are also
related to the recent interest in combining inductive logic programming
principles with probability theory {De Raedt & Kersting, 2003, 2004}.
In the first typ e of approaches, the underlying idea is to upgr ade HMMs
and probabilistic
grammars to represent more structured state spaces. Hierarchical HMMs {Fine,
Singer, & Tishb y, 1998}, factorial HMMs {Ghahramani &
Jordan, 1997}, and HMMs based on tree automata {Frasconi, So da, &Vullo,
2002} decom-
p ose the state variables into smaller units. In hierarchical HMMs states
themselves can b e
HMMs, in factorial HMMs they can b e factored into k state variables which
dep end on one another only through the observation, and in tree based HMMs
the represented probability
distributions are defined ov er tree structures. The key difference with
LOHMMs is that
these approaches do not employ the logical concept of unification. Unification
is essential444Logical Hidden Marko v Modelsb ecause it allows us to intro
duce abstract transitions, which do not consist of more detailed
states. As our exp erimental evidence shows, sharing information among abstract
states by
means of unification can lead to more accurate mo del estimation. The same
holds for re-
lational Markov models {RMMs} {Anderson, Domingos, &Weld, 2002} to which
LOHMMs
are most closely related. In RMMs, states can b e of different typ es, with
each typ e describ ed
by a different set of variables. The domain of each variable can b e hierarchically
structured. The main differences b etw een LOHMMs and RMMs are that RMMs
do not either supp ort
variable binding nor unification nor hidden states. The equivalent of HMMs
for context-free languages are prob abilistic context-fr e e gram-
mars {PCFGs}. Like HMMs, they do not consider sequences of logical atoms
and do not
employ unification. Nevertheless, there is a formal resemblance b etw een
the Baum-Welch
algorithms for LOHMMs and for PCFGs. In case that a LOHMM enco des a PCFG
b oth
algorithms are identical from a theoretical p oint of view. They re-estimate
the parameters
as the ratio of the exp ected num b er of times a transition {resp. pro
duction} is used and the exp ected num b er of times a transition {resp.
pro duction} might hav e b een used. The pro of of Theorem 2 assumes that
the PCFG is given in Greibach normal form
6
{GNF} and uses a
pushdown automaton to parse sentences. For grammars in GNF, pushdown automata
are
common for parsing. In contrast, the actual computations of the Baum-Welch
algorithm
for PCFGs, the so called Inside-Outside algorithm {Baker, 1979; Lari & Young,
1990}, is
usually formulated for grammars in Chomsky normal form 7
. The Inside-Outside algorithm
can make use of the efficient CYK algorithm {Hop croft & Ullman, 1979} for
parsing strings. An alternative to learning PCFGs from strings only is to
learn from more structured data
such as skeletons, which are derivation trees with the nonterminal no des
remov ed {Levy & Joshi, 1978}. Skeletons are exactly the set of trees accepted
by skeletal tre e automata {STA}.
Informally , anSTA, when given a tree as input, pro cesses the tree b ottom
up, assigning a
state to each no de based on the states of that no de's children. The STA
accepts a tree iff
it assigns a final state to the ro ot of the tree. Due to this automata-based
characterization
of the skeletons of derivation trees, the learning problem of {P}CFGs can
b e reduced to
the problem of an STA. In particular, STA techniques hav e b een adapted
to learning tree grammars and {P}CFGs {Sakakibara, 1992; Sakakibara et al.,
1994} efficiently.
PCFGs hav e b een extended in several wa ys. Most closely related to LOHMMs
are
unification-b ase d grammars which hav e b een extensively studied in computational
linguis-
tics. Examples are {sto chastic} attribute-value grammars {Abney, 1997},
probabilistic fea-
ture grammars {Go o dman, 1997}, head-driven phrase structure grammars {Pollard
& Sag,
1994}, and lexical-functional grammars {Bresnan, 2001}. For learning within
such frame-
works, metho ds from undirected graphical mo dels are used; see the work
of Johnson {2003} for a description of some recent work. The key difference
to LOHMMs is that only nonter-
minals are replaced with structured, more complex entities. Thus, observation
sequences of flat symb ols and not of atoms are mo delled. Go o dman's prob
abilistic featur e grammars are
an exception. They treat terminals and nonterminals as vectors of features.
No abstraction
is made, i.e., the feature vectors are ground instances, and no unification
can b e employ ed.6. A grammar is in GNF iff all pro ductions are of the
form A aV where A is a variable, a is exactly one
terminal and V is a string of none or more variables.
7. A grammar is in CNF iff every pro duction is of the form A B; C or
A a where A; B and C are variables,
and a is a terminal.445Kersting, De Raedt, & RaikoKristian Kerstingmkdirvt100xmvnew003vt100xlsvt100xcdvt100xmkdirvt100xlsvt100xmvnew003vt100xcdvt100x{b}{a}conconconFigure
7:{a} Each atom in the logical sequence mkdir{vt100x}, mv{new003; vt100x},
ls{vt100x}, cd{vt100x} formsa tree. The shaded no des denote shared lab els
among the trees. {b} The same sequence represented as a single tree. The
pred-
icate con=2 represents the concatenation op erator.Therefore, the num b
er of parameters that needs to b e estimated b ecomes easily very large,
data sparsity is a serious problem. Go o dman applied smo othing to ov ercome
the problem.
LOHMMs are generally related to {sto chastic} tree automata {see e.g., Car-
rasco, Oncina, and Calera-Rubio, 2001}. Reconsider the Unix command sequence
mkdir{vt100x} ; mv{new003; vt100x}; ls{vt100x}; cd{vt100x} . Each atom
forms a tree, see
Figure 7 {a}, and, indeed, the whole sequence of atoms also forms a {degenerated}
tree,
see Figure 7 {b}. Tree automata pro cess single trees vertically, e.g.,
b ottom-up. A state in
the automaton is assigned to every no de in the tree. The state dep ends
on the no de lab el
and on the states asso ciated to the siblings of the no de. They do not
fo cus on sequential
domains. In contrast, LOHMMs are intended for learning in sequential domains.
They
pro cess sequences of trees horizontally, i.e., from left to right. Furthermore,
unification is used to share information b etw een consecutive sequence elements.
As Figure 7 {b}
illustrates, tree automata can only employ this information when allowing
higher-order
transitions, i.e., states dep end on their no de lab els and on the states
asso ciated to
predecessors 1; 2; : : : levels down the tree.
In the second typ e of approaches, most attention has b een devoted to developing
highly expressive formalisms, such as e.g. PCUP {Eisele, 1994}, PCLP {Riezler,
1998}, SLPs {Mug-
gleton, 1996}, PLPs {Ngo & Haddawy, 1997}, RBNs {Jaeger, 1997}, PRMs {Friedman,
Geto or, Koller, & Pfeffer, 1999}, PRISM {Sato & Kameya, 2001}, BLPs {Kersting
& De
Raedt, 2001b, 2001a}, and DPRMs {Sanghai, Domingos, & Weld, 2003}. LOHMMs
can b e
seen as an attempt tow ards downgrading such highly expressive frameworks.
Indeed, apply-
ing the main idea underlying LOHMMs to non-regular probabilistic grammar,
i.e., replacing flat symb ols with atoms, yields { in principle { sto chastic
logic programs {Muggleton, 1996}. As a consequence, LOHMMs represent an interesting
p osition on the expressiveness scale. Whereas they retain most of the essential
logical features of the more expressive formalisms,
they seem easier to understand, adapt and learn. This is akin to many contemp
orary consid-446Logical Hidden Marko v Modelserations in inductive logic
pro gr amming {Muggleton & De Raedt, 1994} and multi-relational data mining
{D024zeroski & Lavra 024c, 2001}. 8. Conclusions
Logical hidden Marko v mo dels, a new formalism for representing probability
distributions
ov er sequences of logical atoms, hav e b een intro duced and solutions
to the three central
inference problems {evaluation, most likely state sequence and parameter
estimation} hav e
b een provided. Exp eriments hav e demonstrated that unification can improv
e generalization
accuracy, that the num b er of parameters of a LOHMM can b e an order of
magnitude smaller than the num b er of parameters of the corresp onding HMM,
that the solutions presented
p erform well in practice and also that LOHMMs p ossess several advantages
ov er traditional
HMMs for applications inv olving structured sequences.
Ackno wledgmen ts The authors thank Andreas Karwath and Johannes Horstmann
for
interesting collab orations on the protein data; Ingo Thon for interesting
collab oration on
analyzing the Unix command sequences; and Saul Greenb erg for providing
the Unix com-
mand sequence data. The authors would also like to thank the anonymous reviewers
for com-
ments which considerably improv ed the pap er. This research was partly
supp orted by the
Europ ean Union IST programme under contract num b ers IST-2001-33053 and
FP6-508861
{Application of Probabilistic Inductive Logic Programming {APrIL} I and
I I}. Tapani Raiko
w as supp orted by a Marie Curie fellowship at DAISY, HPMT-CT-2001-00251.
App endix A. Pro of of Theorem 1
Let M = {006; 026; 001; 007} b e a LOHMM. To show that M sp ecifies
a time discrete sto chastic
pro cess, i.e., a sequence of random variables hX
t
i
t=1;2;:::
, where the domains of the random
variable X t
is hb{006}, the Herbrand base ov er 006, we define the immediate state
oper ator
T
M
-op erator and the current emission oper ator E
M
-op erator.Definition 4{T
M
-Oper ator, E
M
-Oper ator } The oper ators T
M
: 2 hb
006
! 2
hb
006
and E
M
:
2
hb
006
! 2
hb
006 are
T
M
{I } = fH 033
B
033 H
j 9{p : H
O
000 B } 2 M : B 033
B
2 I ; H 033
B
033
H 2 G
006
{H }g
E M
{I } = fO 033
B
033
H
033 O
j 9{p : H
O
000 B } 2 M : B 033
B
2 I ; H 033
B
033
G 2 G
006
{H }
and O 033
B
033 H
033
O 2 G
006
{O }g
For each i = 1; 2; 3; : : :, the set T
i+1 M
{fstartg} := T
M {T
i
M {fstartg}) with
T
1
M
{fstartg} := T
M
{fstartg} sp ecifies the state set at clo ck i which forms a random vari-
able Y
i . The set U
i
M
{fstartg} sp ecifies the p ossible symb ols emitted when transitioning from
i to i + 1. It forms the variable U i
. Each Y
i
{resp. U
i
} can b e extended to a random
variable Z
i
{resp. U
i
} ov er hb
006 :
P {Z i
= z } =
032
0:0 : z 62 T
i
M
{fstartg}
P {Y i
= z } : otherwise447Kersting, De Raedt, & RaikoZ1cc
Z2cc
Z3cc
E1ccE2cc
E3cc
Kristian KerstingZ1Z2Z3E3E1E2...PSfrag replacementsZ 1Z
2Z
3U
1U 2U
3Figure 8:Discrete time sto chastic pro cess induced by a LOHMM. The no
des Z
i
and U
i
represent random variables ov er hb
006
.Figure 8 depicts the influence relation among Z
i
and U
i
. Using standard arguments from probability theory and noting that P {U
i = U
i
j Z
i+1
= z
i+1
; Z
i
= z
i
} = P {Z
i+1
= z
i+1
; U
i = u
i
j Z
i
}P
u
i
P {Z
i+1
; u
i
j Z i
}
and P {Z
i+1 j Z
i
} =
X
u i
P {Z i+1
; u i
j Z
i
}
where the probability distributions are due to equation {1}, it is easy
to show that Kol- mogorov's extension theorem {see Bauer, 1991; Fristedt
and Gray, 1997} holds. Thus, M sp ecifies a unique probability distribution
ov er
N
t
i=1 {Z
i
002 U
i
} for each t > 0 and in the limit t ! 1. 003
App endix B. Mo ore Representations of LOHMMs For HMMs, Moor e representations,
i.e., output symb ols are asso ciated with states and Mealy
representations, i.e., output symb ols are asso ciated with transitions,
are equivalent. In this app endix, we will inv estigate to which extend this
also holds for LOHMMs.
Let L b e a Mealy-LOHMM according to definition 3. In the following, we
will derive
the notation of an equivalent LOHMM L
0
in Mo ore representation where there are abstract
transitions and abstract emissions {see b elow}. Each predicate b=n in L
is extended to b=n +
1 in L 0
. The domains of the first n arguments are the same as for b=n. The last
argument
will store the observation to b e emitted. More precisely, for each abstract
transition
p : h{w
1
; : : : ; w
l
}
o{v
1
;:::;v
k
}
000000 000 000 000000 b{u
1 ; : : : ; u
n
}
in L, there is an abstract transition p : h{w
1
; : : : ; w l
; o{v 0
1
; : : : ; v
0
k
}) b{u
1
; : : : ; u n
;}
in L
0
. The primes in o{v
0
1
; : : : ; v 0
k
} denote that we replaced each free
8
variables o{v
1 ; : : : ; v
k
}
by some distinguished constant symb ol, say #. Due to this, it holds that
026{h{w
1
; : : : ; w
l
}) = 026{h{w
1
; : : : ; w
l
; o{v
0
1
; : : : ; v
0 k
})} ; {6}8. A variable X 2 vars{o{v
1
; : : : ; v k
}) is free iff X 62 vars{h{w
1
; : : : ; w
l
}) [ vars{b{u
1
; : : : ; u n
}).448Logical Hidden Marko v Modelsand L
0
's output distribution can b e sp ecified using abstract emissions which
are expressions
of the form
1:0 : o{v
1
; : : : ; v
k
} h{w
1 ; : : : ; w
l ; o{v
0 1
; : : : ; v 0
k
}) : {7}
The semantics of an abstract transition in L
0
is that b eing in some
state S 0
t
2 G
006
0
{b{u
1 ; : : : ; u
n
;}} the system will make a transition into state
S
0
t +1
2 G
006 0
{h{w 1
; : : : ; w
l
; o{v
0
1
; : : : ; v
0 k
})} with probability
p 001 026{S
0
t +1
j h{w 1
; : : : ; w
l
; o{v
0
1
; : : : ; v
0 k
}) j 033 S
0
t } {8}
where 033
S
0
t
= mgu{S 0
t
; b{u
1
; : : : ; u
n ;}}. Due to Equation {6}, Equation {8} can b e rewritten
as
p 001 026{S
0 t +1
j h{w
1
; : : : ; w
l } j 033
S 0
t
} :
Due to equation {7}, the system will emit the output symb ol o
t +1
2 G
006 0
{o{v 1
; : : : ; v
k
}} in
state S
0 t +1
with probability
026{o
t +1
j o{v
1 ; : : : ; v
k
}033
S 0
t +1 033
S
0 t
}
where 033
S
0
t +1
= mgu{h{w
1
; : : : ; w
l ; o{v
0 1
; : : : ; v 0
k
}); S
0
t +1
}. Due to the construction of L
0
, there
exists a triple {S
t
; S t +1
; O
t+1
} in L for each triple {S
0
t
; S
0
t +1
; O
t +1
}, t > 0, in L
0 {and vise
versa}. Hence,b oth LOHMMs assign the same ov erall transition probability.
L and L
0
differ only in the wa y the initialize sequences h{S
0
t
; S
0
t +1
; O
t +1
i
t=0;2:::;T
{resp.
h{S
t
; S
t +1
; O
t+1
i
t=0;2:::;T
}. Whereas L starts in some state S
0
and makes a transition to S
1 emitting O
1 , the Mo ore-LOHMM L 0
is supp osed to emit a symb ol O 0
in S
0
0
b efore making a
transition to S 0
1
. We comp ensate for this using the prior distribution. The existence of
the correct prior distribution for L 0
can b e seen as follows. In L, there are only finitely many
states reachable at time t = 1, i.e, P
L
{q
0
= S} > 0 holds for only a finite set of ground
states S. The probability P
L {q
0
= s} can b e computed similar to ff
1
{S}. We set t = 1 in line
6, neglecting the condition on O
t0001
in line 10, and dropping 026{O
t0001
j O033
B
033
H } from line 14.
Completely listing all states S 2 S 1
together with P L
{q
0
= S}, i.e., P
L
{q 0
= S} : S start ,
constitutes the prior distribution of L
0
.
The argumentation basically follow ed the approach to transform a Mealy
machine into
a Mo ore machine {see e.g., Hop croft and Ullman, 1979}. Furthermore, the
mapping of a
Mo ore-LOHMM { as intro duced in the present section { into a Mealy-LOHMM
is straight-
forw ard.
App endix C. Pro of of Theorem 2 Let T b e a terminal alphab et and N a
nonterminal alphab et. A prob abilistic context-fr e e gr ammar {PCFG} G
consists of a distinguished start symb ol S 2 N plus a finite set of
pro ductions of the form p : X ! ff, where X 2 N , ff 2{N [ T }
003
and p 2 [0; 1]. For all X 2 N ,
P
:X !ff p = 1. A PCFG defines a sto chastic pro cess with senten tial forms
as states,
and leftmost rewriting steps as transitions. We denote a single rewriting
op eration of the
grammar by a single arrow !. If as a result of one ore more rewriting op
erations we are able to rewrite fi 2 {N [ T }
003
as a sequence fl 2 {N [ T } 003
of nonterminals and terminals,
then we write fi }
003 fl . The probability of this rewriting is the pro duct of all probability449Kersting,
De Raedt, & Raikovalues asso ciated to pro ductions used in the derivation.
We assume G to b e consistent, i.e.,
that the sum of all probabilities of derivations S }
003 fi such that fi 2 T
003
sum to 1:0.
We can assume that the PCFG G is in Greibach normal form. This follows from
Abney
et al.'s {1999} Theorem 6 b ecause G is consistent. Thus, every pro duction
P 2 G is of
the form p : X ! aY
1
: : : Y n
for some n 025 0. In order to enco de G as a LOHMM M , we
in tro duce {1} for each non-terminal symb ol X in G a constant symb ol
nX and {2} for each
terminal symb ol t in G a constant symb ol t. For each pro duction P 2 G,
we include an abstract transition of the form p : stack{[nY
1
; : : : ; nY n
jS]} a
000 stack{[nXjS]}, if n > 0, and
p : stack{S}
a
000 stack{[nXjS]}, if n = 0. Furthermore, we include 1:0 : stack{[s]}
start
and 1:0 : end
end 000 000 stack{[]}. It is now straightforw ard to prov e by induction
that M and G
are equivalent. 003
App endix D. Logical Hidden Marko v Mo del for Unix Command Sequences
The LOHMMs describ ed b elow mo del Unix command sequences triggered by
mkdir. To
this aim, we transformed the original Greenb erg data into a sequence of
logical atoms ov er com; mkdir{Dir; LastCom} ; ls{Dir; LastCom}; cd{Dir;
Dir; LastCom} ; cp{Dir; Dir; LastCom}
and mv{Dir; Dir; LastCom} . The domain of LastCom was fstart; com; mkdir;
ls; cd; cp; mvg.
The domain of Dir consisted of all argument entries for mkdir; ls; cd; cp;
mv in the original
dataset. Switches, pip es, etc. were neglected, and paths were made absolute.
This yields
212 constants in the domain of Dir. All original commands, which were not
mkdir; ls; cd;
cp; or mv, were represented as com. If mkdir did not app ear within 10 time
steps b efore a
command C 2 fls; cd; cp;mvg, C was represented as com. Overall, this yields
more than 451000 ground states that hav e to b e cov ered by a Marko v mo
del. The \unification" LOHMM U basically implements a second order Marko
v mo del, i.e.,
the probability of making a transition dep ends up on the current state
and the previous
state. It has 542 parameters and the following structure:
com start:
mkdir{Dir; start} start:
com com: mkdir{Dir; com} com:
end com:
Furthermore, for each C 2 fstart; comg there are
mkdir{Dir; com} mkdir{Dir;C }:
mkdir{; com} mkdir{Dir;C }:
com mkdir{Dir;C }:
end mkdir{Dir;C }:
ls{Dir; mkdir} mkdir{Dir;C }:
ls{; mkdir} mkdir{Dir;C }:
cd{Dir; mkdir} mkdir{Dir;C }:
cd{; mkdir} mkdir{Dir;C }:
cp{; Dir; mkdir} mkdir{Dir;C }:
cp{Dir;; mkdir} mkdir{Dir;C }:
cp{;; mkdir} mkdir{Dir;C }:
mv{; Dir; mkdir} mkdir{Dir;C }:
mv{Dir;; mkdir} mkdir{Dir;C }: mv{;; mkdir} mkdir{Dir;C }:450Logical
Hidden Marko v Modelstogether with for each C 2 fmkdir; ls; cd; cp; mvg and
for each C
1
2 fcd; lsg {resp.
C
2 2 fcp; mvg}
mkdir{Dir; com} C
1 {Dir;C }: mkdir{; com} C
1 {Dir;C }: com C
1 {Dir;C }: end C
1 {Dir;C }: ls{Dir;C
1
} C
1
{Dir;C }:
ls{;C
1
} C
1
{Dir;C }:
cd{Dir;C
1
} C
1
{Dir;C }:
cd{;C
1
} C
1
{Dir;C }:
cp{; Dir;C
1 } C
1 {Dir;C }: cp{Dir;;C
1
} C
1
{Dir;C }:
cp{;;C
1
} C
1
{Dir;C }:
mv{; Dir;C 1
} C 1
{Dir;C }:
mv{Dir;;C
1 } C
1 {Dir;C }: mv{;;C
1
} C
1
{Dir;C }:
mkdir{; com} C
2
{From; To;C }: com C
2 {From; To;C }:
end C 2
{From; To;C }:
ls{From;C
2
} C
2
{From; To;C }:
ls{To;C 2
} C 2
{From; To;C }:
ls{;C
2 } C
2 {From; To;C }:
cd{From;C
2
} C
2
{From; To;C }: cd{To;C
2
} C
2
{From; To;C }:
cd{;C
2
} C
2
{From; To;C }:
cp{From;;C
2 } C
2 {From; To;C }:
cp{; To;C
2 } C
2 {From; To;C }:
cp{;;C
2 } C
2 {From; To;C }:
mv{From;;C
2 } C
2 {From; To;C }:
mv{; To;C
2 } C
2 {From; To;C }:
mv{;;C
2 } C
2 {From; To;C }:
Because all states are fully observable, we omitted the output symb ols
asso ciated with
clauses, and, for the sake of simplicity, we omitted asso ciated probability
values.
The \no unification" LOHMM N is the variant of U where no variables were
shared
such as
mkdir{; com} cp{From; To;C }: com cp{From; To;C }:
end cp{From; To;C }:
ls{; cp} cp{From; To;C }:
cd{; cp} cp{From; To;C }:
cp{;; cp} cp{From; To;C }:
mv{;; cp} cp{From; To;C }:
Because only transitions are affected, N has 164 parameters less than U
, i.e., 378.
App endix E. Tree-based LOHMMfor mRNA Sequences
The LOHMM pro cesses the no des of mRNA trees in in-order. The structure
of the LOHMM
is shown at the end of the section. There are copies of the shaded parts.
Terms are
abbreviated using their starting alphanumerical; tr stands for tree, he
for helical, si for
single, nuc for nucleotide, and nucp for nucleotidepair.
The domain of #Children cov ers the maximal branching factor found in the
data, i.e.,
f[c]; [c; c]; : : : ; [c; c; c; c; c; c; c; c; c]g; the domain of Type consists
of all typ es o ccurring in
the data, i.e., fstem; single; bulge3; bulge5; hairping; and for Size ,
the domain cov ers the maximal length of a secondary structure element in
the data, i.e., the longest sequence
of consecutive bases resp ectively base pairs constituting a secondary structure
element.
The length was enco ded as fn
1
{0}; n
2 {0}; : : : ; n
13
{0}g where n
m {0} denotes the recursive
application of the functor n m times. For Base and BasePair , the domains
were the 4 bases
resp ectively the 16 base pairs. In total, there are 491 parameters.451Kersting,
De Raedt, & RaikoKristian Kerstingse{T; L; s{Id}; B; [s{Id} 000 B; Pa 000
[C2jCs]jR]}0:25 : he{s{Id}; Pa; B; T; L}0:25 : si{s{Id}; Pa; []; T; L}se{hairpin;
A; Id; B; S}se{hairpin; n{A}; Id; B; S}se{hairpin; n{0}; Id; B; S}se{stem;
A; Id; B; S}se{stem; n{0}; s{};; []}se{stem; n{0}; Id; B; S}0:0625 : nucp{a;
a}Copies for nuc{g}, nuc{c}, and nuc{u}0:25 : nuc{a}Copies for each length
of sequence n{n{0}), n{n{n{0})}, n{n{n{n{0})})0:0625 : nucp{a; a}0:25
: nuc{a}0:0625 : nucp{a; a}0:25 : nuc{a}Copies for n{n{0}) and n{n{n{0})}Copies
for each typ e single, bulge3, bulge5endCopies for nucp{a; g}; : : : ; nucp{u;
u}0:25 : he{s{Id}; Pa; []; T; L}tr{Id;; [Pa 000 [C1; C2jCs]jR]}tr{0; X;
[0 000 X]}0:25 : si{s{Id}; Pa; B; T; L}se{T; L; s{Id}; []; [Pa 000 [C2jCs]jR]}mystartstart1:01:0
: root{0; root; X}and tr{Id; [c; c; c]; [Pa 000 [C]jR]}Copies for tr{Id;
[c]; [Pa 000 [C]jR]}, tr{Id; [c; c]; [Pa 000 [C]jR]},Copies for tr{Id;
[c]; [Pa 000 [C1; C2jCs]jR]}, tr{Id; [c; c]; [Pa 000 [C1; C2jCs]jR]},and
tr{Id; [c; c; c]; [Pa 000 [C1; C2jCs]jR]}tr{Id;; [Pa 000 [C]jR]}0:25 :
he{s{Id}; Pa; B; T; L}0:25 : si{s{Id}; Pa; B; T; L}0:25 : he{s{Id}; Pa; [];
T; L}se{T; L; s{Id}; []; R}se{T; L; s{Id}; B; [s{Id} 000 BjR]}0:25 : si{s{Id};
Pa; []; T; L}se{hairpin; n{0}; s{};; []}Copies for nuc{g}, nuc{c}, and nuc{u}tr{Id;
B; S}se{stem; n{A}; Id; B; S}Copies for nucp{a; g}; : : : ; nucp{u; u}mo
deltreesequencemo delFigure 9:The mRNA LOHMM structure. The symb oldenotes
anonymous variables which are read and treated as distinct, new variables
each time they are encountered.
There are copies of the shaded part. Terms are abbreviated using their starting
alphanumerical; tr stands for tree, se for structureelement, he for helical,
si for single, nuc for nucleotide, and nucp for nucleotidepair.ReferencesAbney,
S. {1997}. Sto chastic Attribute-Value Grammars. Computational Linguistics,
23 {4},
597{618.452Logical Hidden Marko v ModelsAbney, S., McAllester, D., & Pereira,
F. {1999}. Relating probabilistic grammars and au-
tomata. In Pro c e e dings of 37th Annual Meeting of the Association for
Computational
Linguistics {ACL-1999}, pp. 542{549. Morgan Kaufmann.Anderson, C., Domingos,
P., & Weld, D. {2002}. Relational Marko v Mo dels and their Ap-
plication to Adaptive Web Navigation. In Pro c e e dings of the Eighth International
Conferenc e on Know ledge Discovery and Data Mining {KDD-2002}, pp. 143{152
Ed-
monton, Canada. ACM Press.Baker, J. {1979}. Trainable grammars for sp eech
recognition. In Spe e ch communic ation
p ap er presente d at th 97th Meeting of the Acoustic al Society of America,
pp. 547{550 Boston, MA.Bauer, H. {1991}. Wahrscheinlichkeitstheorie {4. edition}.
Walter de Gruyter, Berlin, New
York.Baum, L. {1972}. An inequality and asso ciated maximization technique
in statistical esti- mation for probabilistic functions of marko v pro cesses.
Inequalities, 3, 1{8.Bohneb eck, U., Horvath, T., & Wrob el, S. {1998}.
Term comparison in first-order similarity
measures. In Pro c e e dings of the Eigth International Conferenc e on Inductive
Lo gic Pr o gr amming {ILP-98}, Vol. 1446 of LNCS, pp. 65{79. Springer.Bresnan,
J. {2001}. Lexical-Functional Syntax. Blackw ell, Malden, MA.Carrasco, R.,
Oncina, J., & Calera-Rubio, J. {2001}. Sto chastic inference of regular tree
languages. Machine Le arning, 44 {1/2}, 185{197.Chandonia, J., Hon, G., Walker,
N., Lo Conte, L., P.Ko ehl, & Brenner, S. {2004}. The
ASTRAL comp endium in 2004. Nucleic Acids Rese ar ch, 32, D189{D192.Davison,
B., & Hirsh, H. {1998}. Predicting Sequences of User Actions. In Pre dicting
the
Future: AI Appro aches to Time-Series Analysis, pp. 5{12. AAAI Press.De
Raedt, L., & Kersting, K. {2003}. Probabilistic Logic Learning. ACM-SIGKDD
Explo-
rations: Spe cial issue on Multi-Relational Data Mining, 5 {1}, 31{48.De
Raedt, L., & Kersting, K. {2004}. Probabilistic Inductive Logic Programming.
In
Ben-David, S., Case, J., & Maruoka, A. {Eds.}, Pro c e e dings of the 15th
International
Conferenc e on Algorithmic Le arning Theory {ALT-2004}, Vol. 3244 of LNCS,
pp.
19{36 Pado va, Italy. Springer.Durbin, R., Eddy, S., Krogh, A., & Mitchison,
G. {1998}. Biologic al sequenc e analysis:
Prob abilistic models of proteins and nucleic acids. Cambridge Universit
y Press.D024zeroski, S., & Lavra 024c, N. {Eds.}. {2001}. Relational data
mining. Springer-Verlag, Berlin.Eddy, S., & Durbin, R. {1994}. RNA sequence
analysis using covariance mo dels. Nucleic
Acids Res., 22 {11}, 2079{2088.453Kersting, De Raedt, & RaikoEisele, A.
{1994}. Tow ards probabilistic extensions of contrain t-based grammars. In
Dªorne, J. {Ed.}, Computational Aspe cts of Constraint-Base d Linguistics
Decription-II.
DYNA-2 deliverable R1.2.B.Fine, S., Singer, Y., & Tishb y, N. {1998}. The
hierarchical hidden marko v mo del: analysis
and applications. Machine Le arning, 32, 41{62.Frasconi, P., So da, G.,
& Vullo, A. {2002}. Hidden marko v mo dels for text categorization in multi-page
do cuments. Journal of Intel ligent Information Systems, 18, 195{217.Friedman,
N., Geto or, L., Koller, D., & Pfeffer, A. {1999}. Learning probabilistic
relational
mo dels. In Pro c e e dings of Sixteenth International Joint Conferenc e
on Artificial In-
tel ligence {IJCAI-1999}, pp. 1300{1307. Morgan Kaufmann.Fristedt, B., &
Gray, L. {1997}. A Modern Appro ach to Prob ability Theory. Probability and
its applications. Birkhªauser Boston.Ghahramani, Z., & Jordan, M. {1997}.
Factorial hidden Marko v mo dels. Machine Le arning,
29, 245{273.Go o dman, J. {1997}. Probabilistic feature grammars. In Pro
c e e dings of the Fifth Interna-
tional Workshop on Parsing Technolo gies {IWPT-97} Boston, MA, USA.Greenb
erg, S. {1988}. Using Unix: collected traces of 168 users. Tech. rep., Dept.
of
Computer Science, Universit y of Calgary, Alb erta.Hop croft, J., & Ullman,
J. {1979}. Intro duction to Automata Theory, Languages, and
Computation. Addison-Wesley Publishing Company.Horvath, T., Wrob el, S.,
& Bohneb eck, U.{2001}. Relational Instance-Based learning with
Lists and Terms. Machine Le arning, 43 {1/2}, 53{80.Hubbard, T., Murzin,
A., Brenner, S., & Chotia, C. {1997}. SCOP : a structural classification
of proteins database. NAR, 27 {1}, 236{239.Jacobs, N., & Blo ck eel, H.
{2001}. The Learning Shell: Automated Macro Construction. In User Modeling
2001, pp. 34{43.Jaeger, M. {1997}. Relational Bay esian netw orks. In Pro
c e e dings of the Thirteenth Confer- ence on Uncertainty in Artificial Intel
ligence {UAI}, pp. 266{273. Morgan Kaufmann.Katz, S. {1987}. Estimation of
probabilities from sparse data for hte language mo del com- p onent of a
sp eech recognizer. IEEE Transactions on Ac oustics, Spe e ch, and Signal
Pro c essing {ASSP}, 35, 400{401.Kersting, K., & De Raedt, L. {2001a}. Adaptive
Bay esian Logic Programs. In Rouveirol,
C., & Sebag, M. {Eds.}, Pro c e e dings of the 11th International Conferenc
e on Inductive
Lo gic Pro gr amming {ILP-01}, Vol. 2157 of LNAI, pp. 118{131. Springer.454Logical
Hidden Marko v ModelsKersting, K., & De Raedt, L. {2001b}. Tow ards Combining
Inductive Logic Programming
with Bay esian Netw orks. In Rouveirol, C., & Sebag, M. {Eds.}, Pro c e
e dings of the
11th International Conferenc e on Inductive Lo gic Pro gr amming {ILP-01},
Vol. 2157
of LNAI, pp. 118{131. Springer.Kersting, K., & Raiko, T. {2005}. 'Say EM'
for Selecting Probabilistic Mo dels for Logical Sequences. In Bacch us, F.,
& Jaakkola, T. {Eds.}, Pro c e e dings of the 21st Conferenc e
on Uncertainty in Artificial Intel ligence, UAI 2005, pp. 300{307 Edinburgh,
Scotland.Kersting, K., Raiko, T., Kramer, S., & De Raedt, L. {2003}. Tow
ards discov ering struc- tural signatures of protein folds based on logical
hidden marko v mo dels. In Altman,
R., Dunker, A., Hunter, L., Jung, T., & Klein, T. {Eds.}, Pro c e e dings
of the Pa-
cific Symposium on Bioc omputing {PSB-03}, pp. 192{203 Kauai, Haw aii, USA.
World Scientific.Koivisto, M., Kivio ja, T., Mannila, H., Rastas, P., & Ukkonen,
E. {2004}. Hidden Marko v Mo delling Techniques for Haplotyp e Analysis.
In Ben-David, S., Case, J., & Maruoka,
A. {Eds.}, Pro c e e dings of 15th International Conferenc e on Algorithmic
Le arning The-
ory {ALT-04}, Vol. 3244 of LNCS, pp. 37{52. Springer.Koivisto, M., Perola,
M., Varilo, T., Hennah, W., Ekelund, J., Lukk, M., Peltonen, L.,
Ukkonen, E., & Mannila, H. {2002}. An MDL metho d for finding haplotyp e
blo cks
and for estimating the strength of haplotyp e blo ck b oundaries. In Altman,
R., Dunker,
A., Hunter, L., Jung, T., & Klein, T. {Eds.}, Pro c e e dings of the Pacific
Symposium
on Bioc omputing {PSB-02}, pp. 502{513. World Scientific.Korv emak er, B.,
& Greiner, R. {2000}. Predicting UNIX command files: Adjusting to user patterns.
In Adaptive User Interfaces: Papers from the 2000 AAAI Spring Symposium,
pp. 59{64.Kulp, D., Haussler, D., Reese, M., & Eeckman, F. {1996}. A Generalized
Hidden Marko v
Mo del for the Recognition of Human Genes in DNA. In States, D., Agarwal,
P.,
Gaasterland, T., Hunter, L., & Smith, R. {Eds.}, Pro c e e dings of the
Fourth Interna-
tional Conferenc e on Intel ligent Systems for Molecular Biology,{ISMB-96},
pp. 134{
142 St. Louis, MO, USA. AAAI.Lane, T. {1999}. Hidden Marko v Mo dels for
Human/Computer Interface Mo deling. In
Rudstrªom, 027
A. {Ed.}, Pro c e e dings of the IJCAI-99 Workshop on Le arning about Users,
pp. 35{44 Sto ckholm, Sweden.Lari, K., & Young, S. {1990}. The estimation
of sto chastic context-free grammars using the inside-outside algorithm.
Computer Spe e ch and Language, 4, 35{56.Levy, L., & Joshi, A. {1978}. Skeletal
structural descriptions. Information and Control, 2 {2}, 192{211.McLachlan,
G., & Krishnan, T. {1997}. The EM Algorithm and Extensions. Wiley, New York.455Kersting,
De Raedt, & RaikoMitchell, T. M. {1997}. Machine Le arning. The McGraw-Hill
Companies, Inc.Muggleton, S. {1996}. Sto chastic logic programs. In De Raedt,
L. {Ed.}, Advanc es in
Inductive Lo gic Pro gr amming, pp. 254{264. IOS Press.Muggleton, S., &
De Raedt, L. {1994}. Inductive logic programming: Theory and metho ds.
Journal of Lo gic Pro gr amming, 19 {20}, 629{679.Ngo, L., & Haddawy, P.
{1997}. Answering queries from context-sensitiv e probabilistic
knowledge bases. Theor etic al Computer Science, 171, 147{177.Pollard, C.,
& Sag, I. {1994}. Head-driven Phrase Structure Grammar. The Universit y of
Chicago Press, Chicago.Rabiner, L., & Juang, B.{1986}. An Intro duction to
Hidden Marko v Mo dels. IEEE ASSP Magazine, 3 {1}, 4{16.Riezler, S. {1998}.
Statistical inference and probabilistic mo delling for constraint-based nlp.
In Schrder, B., Lenders, W., & und T. Portele, W. H. {Eds.}, Pro c e e dings
of
the 4th Conferenc e on Natural Language Pro c essing {KONVENS-98}. Also
as CoRR cs.CL/9905010.Sakakibara, Y. {1992}. Efficient learning of context-free
grammars from p ositive structural
examples. Information and Computation, 97 {1}, 23{60.Sakakibara, Y. {2003}.
Pair hidden marko v mo dels on tree structures. Bioinformatics,
19 {Suppl.1}, i232{i240.Sakakibara, Y., Brown, M., Hughey, R., Mian, I.,
Sjolander, K., & Underwo o d, R. {1994}. Sto chastic context-free grammars
for tRNA mo delling. Nucleic Acids Rese ar ch, 22 {23}, 5112{5120.Sanghai,
S., Domingos, P., & Weld, D. {2003}. Dynamic probabilistic relational mo
dels. In Gottlob, G., & Walsh, T. {Eds.}, Pro c e e dings of the Eighteenth
International Joint
Conferenc e on Artificial Intel ligence {IJCAI-03}, pp. 992{997 Acapulco,
Mexico. Mor- gan Kaufmann.Sato, T., & Kameya, Y. {2001}. Parameter learning
of logic programs for symb olic-statistical mo deling. Journal of Artificial
Intel ligence Rese ar ch {JAIR}, 15, 391{454.Schªolkopf, B., & Warmuth, M.
{Eds.}. {2003}. Le arning and Parsing Stochastic Unification-
Base d Grammars, Vol. 2777 of LNCS. Springer.Turcotte, M., Muggleton, S.,
& Sternb erg, M. {2001}. The effect of relational background kno wledge on
learning of protein three-dimensional fold signatures. Machine Le arning,
43 {1/2}, 81{95.Won, K., Pr ªugel-Bennett, A., & Krogh, A. {2004}. The Blo
ck Hidden Marko v Mo del for Bi-
ological Sequence Analysis. In Negoita, M., Howlett, R., & Jain, L. {Eds.},
Pro c e e dings
of the Eighth International Conferenc e on Know ledge-Base d Intel ligent
Information
and Engineering Systems {KES-04}, Vol. 3213 of LNCS, pp. 64{70. Springer.456