Effective Dimensions of Hierarchical Latent Class Models

N. L. Zhang and T. Kocka

Volume 21, 2004

Links to Full Text:

Abstract:
Hierarchical latent class (HLC) models are tree-structured Bayesian networks where leaf nodes are observed while internal nodes are latent. There are no theoretically well justified model selection criteria for HLC models in particular and Bayesian networks with latent nodes in general. Nonetheless, empirical studies suggest that the BIC score is a reasonable criterion to use in practice for learning HLC models. Empirical studies also suggest that sometimes model selection can be improved if standard model dimension is replaced with effective model dimension in the penalty term of the BIC score.

Extracted Text


Journal of Artificial Intelligence Research 21 (2004) 1-17 Submitted 7/03;
published 1/04

Effective Dimensions of Hierarchical Latent Class Models Nevin L. Zhang lzhang@cs.ust.hkDepartment
of Computer Science Hong Kong University of Science and Technology, China
Tom'a^s Ko^cka kocka@lisp.vse.cz Laboratory for Intelligent Systems Prague
Prague University of Economics, Czech Republic

Abstract Hierarchical latent class (HLC) models are tree-structured Bayesian
networks whereleaf nodes are observed while internal nodes are latent. There
are no theoretically well

justified model selection criteria for HLC models in particular and Bayesian
networks withlatent nodes in general. Nonetheless, empirical studies suggest
that the BIC score is a reasonable criterion to use in practice for learning
HLC models. Empirical studies alsosuggest that sometimes model selection
can be improved if standard model dimension is replaced with effective model
dimension in the penalty term of the BIC score.

Effective dimensions are difficult to compute. In this paper, we prove a
theorem thatrelates the effective dimension of an HLC model to the effective
dimensions of a number

of latent class models. The theorem makes it computationally feasible to
compute theeffective dimensions of large HLC models. The theorem can also
be used to compute the effective dimensions of general tree models.

1. Introduction Hierarchical latent class (HLC) models (Zhang, 2002) are
tree-structured Bayesian networks (BNs) where leaf nodes are observed while
internal nodes are latent. They generalize latent class models (Lazarsfeld
and Henry, 1968) and were first identified as a potentially useful class
of Bayesian networks by Pearl (1988). We are concerned with learning HLC
models from data. A fundamental question is how to select among competing
models.

The BIC score (Schwarz, 1978) is a popular metric that researchers use to
select among Bayesian network models. It consists of a loglikelihood term
that measures the fitness to data and a penalty term that depends linearly
upon standard model dimension, i.e. the number of linearly independent standard
model parameters. When all variables are observed, the BIC score is an asymptotic
approximation of (the logarithm) of the marginal likelihood (Schwarz, 1978).
It is also consistent in the sense that, given sufficient data, the BIC score
of the generative model -- the model from which data were sampled -- is larger
than those of any other models that are not equivalent to the generative
model.

When latent variables are present, the BIC score is no longer an asymptotic
approximation of the marginal likelihood (Geiger et al., 1996). This can
be remedied, to some extent, using the concept of effective model dimension.
In fact if we replace standard model dimension with effective model dimension
in the BIC score, the resulting scoring function, called the BICe score,
is an asymptotic approximation of the marginal likelihood almost everywhere
except for some singular points (Rusakov and Geiger, 2002).

cr^2004 AI Access Foundation. All rights reserved.

Zhang & Ko^cka Neither BIC nor BICe have been proved to be consistent for
latent variable models. As a matter of fact, it has not even been defined
what it means for a model selection criterion to be consistent for latent
variable models. Empirical studies suggest that the BIC score is well-behaved
in practice for the task of learning HLC models. There are three related
searchbased algorithms for learning HLC models, namely double hill-climbing
(DHC) (Zhang, 2002), single hill-climbing (SHC) (Zhang et al., 2003), and
heuristic SHC (HSHC) (Zhang, 2003). In the absence of a theoretically well
justified model selection criterion, Zhang (2002) tested DHC with four existing
scoring functions, namely the AIC score (Akaike, 1974), the BIC score, the
Cheeseman-Stutz (CS) score (Cheeseman and Stutz, 1995), and the holdout logarithmic
score (HLS)(Cowell et al., 1999). Both real-world and synthetic data were
used. On the real-world data, BIC and CS have enabled DHC to find models
that are regarded as the best by domain experts. On the synthetic data, BIC
and CS have enabled DHC to find models that either are identical to or resemble
closely the true generative models. When coupled with AIC and HLS, on the
other hand, DHC performed significantly worse. SHC and HSHC were tested on
synthetic data sampled from fairly large HLC models (as much as 28 nodes).
Only BIC was used in those tests. In all cases, BIC has enabled SHC and HSHC
to find models that either are identical to or resemble closely the true
generative models. Those empirical results not only indicate that the algorithms
perform well, but also suggest that the BIC is a reasonable scoring function
to use for learning HLC models.

The experiments also reveal that model selection can sometimes be improved
if the BICe score is used instead of the BIC score. We will explain this
in detail in Section 3

In order to use the BICe score in practice, we need a way to compute effective
dimensions. This is not a trivial task. The effective dimension of an HLC
model is the rank of the Jacobian matrix of the mapping from the parameters
of the model to the parameters of the joint distribution of the observed
variables. The number of rows in the Jacobian matrix increases exponentially
with the number of observed variables. The construction of the Jacobian matrix
and the calculation of its rank are both computationally demanding. Moreover
they have to be done algebraically or with very high numerical precision
to avoid degenerate cases. The necessary precision grows with the size of
the matrix.

Settimi and Smith (1998, 1999) studied effective dimensions for two classes
of models: trees with binary variables and latent class (LC) models with
two observed variables. They have obtained a complete characterization of
these two classes. Geiger et al. (1996) computed the effective dimensions
of a number of models. They conjectured that it is rare for the effective
and standard dimensions of an LC model to differ. As a matter of fact, they
found only one such model. Kocka and Zhang (2002) found quite a number of
LC models whose effective and standard dimensions differ. They also proposed
an easily computable formula for estimating effective dimensions of LC models.
The estimation formula has been empirically shown to be very accurate.

In this paper, we prove a theorem that relates the effective dimension of
an HLC model to the effective dimensions of two other HLC models that contain
fewer latent variables. Repeated application of the theorem allows one to
reduce the task of computing the effective dimension of an HLC model to subtasks
of computing effective dimensions of LC models. This makes it computationally
feasible to compute the effective dimensions of large HLC models.

2

Effective Dimensions of HLC Models We start in Section 2 with a formal definition
of effective dimensions for Bayesian networks with latent variables. In Section
3, we provide empirical evidence that suggest the use of BICe instead of
BIC sometimes improves model selection. Section 4 presents the main theorem
and Section 5 is devoted to the proof of the theorem. In Section 6, we prove
a theorem about effective dimensions of general tree models and explain how
this and our main theorem allows one to compute the effective dimension of
arbitrary tree models. Finally, concluding remarks are provided in Section
7.

2. Effective Dimensions of Bayesian Networks In this paper, we use capital
letters such as X and Y to denote variables and lower case letters such as
x and y to denote states of variables. The domain and cardinality of a variable
X will be denoted by Omega X and |X| respectively. Bold face capital letters
such as Y denote sets of variables. Omega Y denotes the Cartesian product
of the domains of all variables in the set Y. Elements of Omega Y will be
denoted by bold lower case letters such as y and will sometimes be referred
to as states of Y. We will consider only variables that have a finite number
of states.

Consider a Bayesian network model M that possibly contains latent variables.
The standard dimension ds(M ) of M is the number of linearly independent
parameters in the standard parameterization of M . The parameters denote,
for each variable and each parent configuration of the variable, the probability
that the variable is in some state (except one) given the parent configuration.
Suppose M consist of k variables x1, x2, . . . , xk. Let ri and qi be respectively
the number of states of xi and the number of all possible combinations of
the states of its parents. If xi has no parent, let qi be 1. Then ds(M )
is given by

ds(M ) =

kX

i=1

qi(ri - 1).

For notational simplicity, denote the standard dimension of M by n. Let ~t,=(t,1,
t,2, . . . , t,n) be a vector of n linearly independent model parameters
of M . Further let Y be the set of observed variables. Suppose Y has m+1
possible states. We enumerate the first m states as y1, y1, . . . , ym.

For any i (1<=i<=m), P (yi) is a function of the parameters ~t,. So we have
a mapping from the n dimensional parameter space (a subspace of Rn) to Rm,
namely T : (t,1, t,2, . . . , t,n) ` (P (y1), P (y2), . . . , P (ym)). The
Jacobian matrix of this mapping is the following m*n matrix:

JM (~t,) = [Jij] = [ @P (yi)@t,

j ]

For convenience, we will often write the matrix as JM = [ @P (Y)@t,j ], with
the understanding that elements of the j-th column are obtained by allowing
Y run over all its possible states except one.

For each i, P (yi) is a function of ~t,. For most commonly used parameterizations
of Bayesian networks, it is actually a polynomial function of ~t,. Hence
we make the following assumption:

3

Zhang & Ko^cka Assumption 1 The Bayesian network M is so parameterized that
the parameters for the joint distribution of the observed variables are polynomial
functions of the parameters for M .

An obvious consequence of the assumption is that elements of JM are also
polynomial functions of ~t,.

For a given value of ~t,, JM is a matrix of real numbers. Due to Assumption
1, the rank of this matrix is some constant d almost everywhere in the parameter
space (Geiger et al., 1996. Also see Section 5.1.). To be more specific,
the rank is d everywhere except in a set of measure zero where it is smaller
than d. The constant is called the regular rank of JM .

The regular rank of JM is also called the effective dimension of the Bayesian
network model M . Hence we denote it by de(M ). To understand the term "effective
dimension", consider the subspace of Rm spanned by the joint probability
P (Y) of observed variables, or equivalently the range of the mapping T .
The term reflects the fact that, for almost every value of ~t,, a small enough
open ball around T (~t,) resembles Euclidean space of dimension d (Geiger
et al., 1996).

There are multiple ways to parameterize a given Bayesian network model. However,
the choice of parameterization does not affect the space spanned by the joint
probability P (Y). Together with the interpretation of the previous paragraph,
this implies that the definition of effective dimension does not depend on
the particular parameterization that one uses.

3. Selecting among HLC Models A hierarchical latent class (HLC) model is
a Bayesian network where (1) the network structure is a rooted tree and (2)
the variables at the leaf nodes are observed and all the other variables
are not. The observed variables are sometimes referred to as manifest variables
and all the other variables as latent variables. Figure 1 shows the structures
of two HLC models. A latent class (LC) model is an HLC model where there
is only one latent variable.

The theme of this paper is the computation of effective dimensions of HLC
models. As mentioned in the introduction, this is interesting because effective
dimension, when used in the BIC score, gives us a better approximation of
the marginal likelihood. In this section, we give an example to illustrate
that the use of effective dimension sometimes also leads to better model
selection. We will also motivate and introduce the concept of regularity
that will be used in subsequent sections.

3.1 An Example of Model Selection Consider the two HLC models shown in Figure
1. In one experiment, we instantiated the parameters of M1 in a random fashion
and sampled a set D1 of 10,000 data records on the observed variables. Then
we ran SHC and HSHC on the data set D1 under the guidance of the BIC score.
Both algorithms produced model M2. In the following, we explain why, based
on D1, one would prefer M2 over M1 if BIC is used for model selection and
why M1 would be preferred if BICe is used instead. We argue that M1 should
be preferred based on D1 and hence BICe is a better scoring metric for this
case.

4

Effective Dimensions of HLC Models

X1 X2 Y1 Y2 Y3

X3 Y4 Y5 Y6

M1

X2 Y1 Y2 Y3 X3

Y4 Y5 Y6

M2 Figure 1: Two HLC models. The shaded variables are latent, while the other
variables are

observed. The cardinality of X1 is 2, while cardinalities of all other variables
are 3.

The BIC and BICe scores of a model M given a data set D are defined as follows:

BIC(M |D) = logP (D|M, ~t,*) - ds(M )2 logN, BICe(M |D) = logP (D|M, ~t,*)
- de(M )2 logN where ~t,* is the maximum likelihood estimate of the parameters
of M based on D and N is the sample size.

In our example, notice that M2 includes M1 in the sense that M2 can represent
any probability distributions of the observed variables that M1 can. In fact,
if we make the conditional probability distributions of the observed variables
in M2 the same as in M1 and set PM2(X2) and PM2(X3|X2) such that

PM2(X2)PM2(X3|X2) = X

X1

PM1(X1)PM1(X2|X1)PM1(X3|X1),

then the probability distribution of the observed variables in the two models
are identical.

Because M2 includes M1, we have logP (D1|M1, ~t,*1) <= logP (D1|M2, ~t,*2).
Together with the fact that D1 is sampled from M1, this implies that logP
(D1|M1, ~t,*1) ij logP (D1|M2, ~t,*2) for sufficiently large enough sample
size. The standard dimension of M1 is 45, while that of M2 is 44. Hence

BIC(M1|D1) < BIC(M2|D1).

On the other hand, the effective dimensions of M1 and M2 are 43 and 44 respectively.
Hence

BICe(M1|D1) > BICe(M2|D1). Model M2 includes M1. The opposite is clearly
not true because the effective dimension of M1 is smaller than that of M2.
So, M2 is in reality a more complex model than M1. Both model fit data D1
equally well. Hence the simpler one, i.e. M1, should be preferred over the
other. This agrees with the choice of the BICe score, while disagrees with
the choice of the BIC score. Hence, BICe is more appropriate than BIC in
this case.

5

Zhang & Ko^cka 3.2 Regularity Now consider another model M 01 that is the
same as M1 except that the cardinality of X1 is increased from 2 to 3. It
is easy to show that M2 includes M 01 and vice versa. So, the two models
are equivalent in terms of their capabilities of representing probability
distributions of the observed variables. They are hence said to be marginally
equivalent. However, M 01 has more standard parameters than M2 and hence
we would always prefer M2 over M 01. To formalize this consideration, we
introduce a concept of regularity.

For a latent variable Z in an HLC model, enumerate its neighbors (parent
and children) as X1, X2, . . . , Xk. An HLC model is regular if for any latent
variable Z,

|Z| <= Q

ki=1 |Xi|

maxki=1 |Xi| , (1) and the strict inequality holds when Z has two neighbors
and at least one of them is a latent node. Models M1 and M2 are regular,
while model M 01 is not.

For any irregular model M there always exists a regular model that is marginally
equivalent to M and has fewer standard parameters (Zhang, 2003b). The regular
model can be obtained from M as follows: For any latent node that has only
two neighbors and its cardinality is no smaller than that of one of the neighbors,
then remove the latent node and connect the two neighbors. For any latent
node that has more than two neighbors and that violates (1), reduce it's
cardinality to the quantity on the right hand side. Repeat both steps until
no more changes can be made.

It is also interesting to note that the collection of all regular HLC models
for a given set of observed variables is finite (Zhang, 2002). This provides
a finite search space for the task of learning regular HLC models.1 In the
rest of this paper, we will consider only regular HLC models.

Before ending this subsection, we point out a nice property of effective
model dimension in relation to model inclusion. If an HLC model includes
another model, then its effective dimension is no less than that of the latter.
As a consequence, two marginally equivalent models have the same effective
dimensions and hence the same BICe score. The same is not true for standard
model dimension and the BIC score.

3.3 The CS and CSe Scores We have argued on empirical grounds that the BIC
score is a reasonable scoring function to use for learning HLC models and
that the BICe score can sometimes improve model selection. But the two scores
are not free of problems. One problem is that their derivation as Laplace
approximations of the marginal likelihood are not valid at the boundary of
the parameter space. The CS score in a way alleviates this problem. It involves
the BIC score based on completed data and the BIC score based on original
data. In other words, it involves two Laplace approximations of the marginal
likelihood. It lets errors in the two approximation cancel each other.

Chickering and Heckerman (1997) empirically found the CS score to be a quite
accurate approximation of the marginal likelihood and robust at the boundary
of the parameter

1. The definition of regularity given in this paper is slightly different
from the one given in Zhang (2002).

Nonetheless, the two conclusions mentioned in this paragraph remain true.

6

Effective Dimensions of HLC Models

Y

21 MMM

X Z

YO ZXX Z O

Figure 2: Problem reduction. space. They realized the need for effective
model dimension in the CS score, although they did not actually use it. This
would not have made any differences to their experiments because, for the
models they used, the standard and effective dimensions agree.

We use CSe to refer to the scoring function one obtains by replacing standard
model dimension in the CS score with effective model dimensions. Just as
BICe is better than BIC as approximations of the marginal likelihood (Geiger
et al., 1996), CSe is better than CS. To compute CSe, we also need to calculate
effective dimensions.

4. Effective Dimensions of HLC Models As we have seen, effective model dimension
is interesting for a number of reasons. Our main result in this paper is
a theorem about the effective dimension de(M ) of a regular HLC model M that
contains more than one latent variable. Let X be the root of M , which is
a latent node. Because there are at least two latent nodes, there must exist
another latent node Z that is a child of X. In the following, we will use
the terms X-branch and Z-branch to respectively refer to the sets of nodes
that are separated from Z by X or from X by Z. Let Y be the set of observed
variables in the Z-branch and let O be the set of all other observed variables.
Note that the X-branch doesn't contain the node X. The relationship among
X, Z, Y, and O is depicted in the left-most picture of Figure 2.

The standard parameterization of M includes parameters for P (X) and parameters
for P (Z|X). For convenience, we replace those parameters with parameters
for P (X, Z). As mentioned at the end of Section 2, such reparameterization
does not affect the effective dimension de(M ). To reflect the reparameterization,
the edge between X and Z is not directed in Figure 2.

Suppose P (X, Z) has k0 parameters t,(0)1 , t,(0)2 , . . . , t,(0)k0 . Suppose
the conditional distributions of variables in the X-branch consists of k1
parameters t,(1)1 , t,(1)2 , . . . , t,(1)k1 and the conditional distributions
of variables in the Z-branch consists of k2 parameters t,(2)1 , t,(2)2 ,
. . . , t,(2)k2 . For convenience we will sometimes refer to those three
groups of parameters using three vectors ~t,(0), ~t,(1) and ~t,(2) respectively.

In the following, we will define two other HLC models M1 and M2 starting
from M and establish a relationship between their effective dimensions and
the effective dimension of M . In this context, M , M1, and M2 are regarded
purely as Mathematical objects. The semantics of their variables are of no
concern. In particular, a variable H that is latent

7

Zhang & Ko^cka X1(6) X2(3) X4(5) Y1(3) Y2(2)

Y3(3) X3(3)

X5(5) Y4(2) Y5(6)

X1 X2 Y3 X3 X2

X4 X1

X3 X1 X5 X4 Y1 Y2 X2

X5 X3 Y4 Y5

Figure 3: The picture on the left shows an HLC model with five observed and
five latent

variables, each variable is annotated by its name and its cardinality. The
picture on the right shows the components we can decompose the HLC model
into by applying Theorem 1. Latent variables are shaded, while observed variables
are not.

in M might be designated to be observed in M1 or M2 as part of the definition
of those Mathematical objects.

We obtain a Bayesian network model B1 from M by deleting the Z-branch. Strictly
speaking B1 is not Bayesian network due to the parameterization it inherits
from M : instead of probability tables P (X) and P (Z|X), we have table P
(X, Z). But P (X) and P (Z|X) can readily be obtained from P (X, Z). With
this in mind, we view B1 as a Bayesian network. This network is obviously
tree-structured. It's leaf variables include those in the set O and the variable
Z. We define M1 to be the HLC model that share the same structure as B1 and
where the variable Z and all the variables in O are observed. The parameters
of M1 are ~t,(0) and ~t,(1).

Similarly let B2 be the Bayesian network model obtained from M by deleting
the X- branch. It is a tree-structure and its leaf variables include those
in Y and the variable X. We define M2 to be the HLC model that share the
same structure as B2 and where the variable X and all the variables in Y
are observed. The parameters of M2 are ~t,(0) and ~t,(2).

Theorem 1 Suppose M is a regular HLC model that contains two or more latent
nodes. Then the two HLC models M1 and M2 defined in the text are also regular.
Moreover,

de(M ) = de(M1)+de(M2)-[ds(M1)+ds(M2)-ds(M )]. (2) In words, the effective
dimension of M equals the sum of the effective dimensions of M1 and M2 minus
the number of common parameters that M1 and M2 share.

To appreciate the significance of this theorem, consider the task of computing
the effective dimension of a regular HLC model that contains two or more
latent nodes. By

8

Effective Dimensions of HLC Models repeatedly applying the theorem, we can
reduce the task into subtasks of calculating effective dimensions of LC models.
As an example, consider the HLC model depicted by the picture on the left
in Figure 3. Theorem 1 allows us to, for the purpose of computing its effective
dimension, decompose the HLC model into five LC models, which are shown on
the right in Figure 3.

How might one compute the effective dimension of an LC model? One way is
to use the algorithm suggested by Geiger et al. (1996). The algorithm first
symbolically computes the Jacobian matrix, which is possible due to Assumption
1. Then it randomly assigns values to the parameters, resulting a numerical
matrix. The rank of the numerical matrix is computed by diagonalization.
Because the rank of Jacobian matrix equals the effective dimension of the
LC model almost everywhere, we get the regular rank with probability one.
This algorithm has recently been implemented by Rusakov and Geiger (2003).
Kocka and Zhang (2002) suggest an alternative algorithm that computes an
upper bound. The algorithm is fast and has been empirically shown to produce
extremely tight bounds.

Going back to our example, the effective dimension of the LC models for X1,
X2, X3, X4 and X5 are 26, 23, 23, 34 and 17 respectively. Thus the effective
dimension of the HLC model in Figure 3 is 26+23+34+23+17-(5*3-1)-(3*6-1)-(6*3-1)-(3*5-1)
= 61. In contrast, the standard dimension of the model is 5+6*2+6*2+6*2+3*4+5*5+5+3*4+5*2+5
= 110.

5. Proof of Main Result This section is devoted to the proof of Theorem 1.
We begin with some properties of Jacobian matrices of Bayesian network models.

5.1 Properties of Jacobian Matrices Consider the Jacobian matrix JM of a
Bayesian network model M . It is a matrix parameterized by the parameters
~t, of M . Let v1, v2, . . . , vm be column vectors of JM .

Lemma 1 A number of column vectors v1, v2, . . . , vm of the Jacobian matrix
JM are either linearly dependent everywhere or linearly independent almost
everywhere. They are linearly dependent everywhere if and only if there exists
at least one column vector vj that can be expressed as a linear combination
of other column vectors everywhere.

Proof: Consider diagonalizing the following transposed matrix:

[v1, v2, . . . , vm]T . According to Assumption 1, elements of the matrix
are polynomials (of ~t,). Hence we would multiply rows with polynomials or
fraction of polynomials. Of course, we need also to add one row to another
row. At the end of the process, we get a diagonal matrix whose nonzero elements
are polynomials or fractions of polynomials. Suppose there are k nonzero
rows and suppose they correspond to v1, v2, . . . , vk.

Because elements of the diagonalized matrix are polynomials or fractions
of polynomials, they are well-defined 2 and nonzero almost everywhere (i.e.
for almost all values of ~t,). If k=m, then the m vectors are linearly independent
of each other almost everywhere.

2. A fraction is not well defined if the denominator is zero.

9

Zhang & Ko^cka If k=k0. Suppose de(M1)=k0+r. Without loss of generality, suppose the
basis vectors are

@P (O, Z)

@t,(0)1 , . . . ,

@P (O, Z)

@t,(0)k0 ;

@P (O, Z)

@t,(1)1 , . . . ,

@P (O, Z)

@t,(1)r . (4)

By symmetry, we can assume that de(M2)=k0+s where s>=0 and that the following
column vectors form a basis for JM2 almost everywhere:

@P (X, Y)

@t,(0)1 , . . . ,

@P (X, Y)

@t,(0)k0 ;

@P (X, Y)

@t,(2)1 , . . . ,

@P (X, Y)

@t,(2)s . (5)

Now consider the following list of vectors in JM :

@P (O, Y)

@t,(0)1 , . . . ,

@P (O, Y)

@t,(0)k0 ;

@P (O, Y)

@t,(1)1 , . . . ,

@P (O, Y)

@t,(1)r ;

@P (O, Y)

@t,(2)1 , . . . ,

@P (O, Y)

@t,(2)s . (6)

We will show

Claim 2: All column vectors of JM linearly depend on the vectors listed in
(6) everywhere.

Claim 3: The vectors listed in (6) are linearly independent almost everywhere.
Those two claims imply that the vectors listed in (6) form a basis of the
column space of JM almost everywhere. Therefore

de(M ) = k0+r+s = de(M1)+de(M2)-k0. It is clear that k0=ds(M1)+ds(M2)-ds(M
). Therefore Theorem 1 is proved. 2

5.3 Proof of Claim 1 Lemma 3 Let Z be a latent node in an HLC model M and
Y be the set of the observed nodes in the subtree rooted at Z. If M is regular,
then we can set conditional distributions of nodes in the subtree in such
a way that they encode an injective mapping ! from Omega Z to Omega Y in
the sense that P (Y=!(z)|Z=z) = 1 for all z 2 Omega Z .

11

Zhang & Ko^cka Proof: We prove this lemma by induction on the number of latent
nodes in the subtree rooted at Z. First consider the case when there is only
one latent node, namely Z. In this case, Z is the parent of all nodes in
Y. Enumerate all these nodes as Y1, Y2, . . . , Yk. Because M is regular,
we have |Z| <= Qki=1 |Yi|. Hence we can define an injective mapping ! from
Omega Z to Omega Y = Qki=1 Omega Yi. For each state z of Z, !(z) can be
written as y = (y1, y2, . . . , yk), where yi is a state of Yi. Now if we
set

P (Yi=yi|Z=z) = 1, then P (Y=!(z)|Z=z) = 1.

Now consider the case when there are at least two hidden nodes in the subtree
rooted at Z. Let W be one such latent node that has no latent node descendants.
Let Y(1) be the set of observed nodes in the subtree rooted at W and Y(2)=YY(1).
By the induction hypothesis, we can parameterize the subtree rooted at W
in such a way that it encodes an injective mapping from Omega W to Omega
Y(1). Moreover, if all nodes below W are removed from M , M remains a regular
HLC model. In that model, we can parameterize the subtree rooted at Z in
such a way that it encodes an injective mapping from Omega Z to Omega (W,Y(2))
= Omega W * Omega Y(2). Together, those two facts prove the lemma. 2

Corollary 1 Let Z be a latent node in an HLC model M . Suppose Z have a latent
neighbor X. Let Y be the set of the observed nodes separated from X by Z.
If M is regular, then we can set probability distributions of nodes separated
from X by Z in such a way that they encode an injective mapping ! from Omega
Z to Omega Y in the sense that P (Y=!(z)|Z=z) = 1 for all z 2 Omega Z .

Proof: The corollary follows readily from Lemma 3 and the property of the
root-walking operation (Zhang, 2002). 2

Proof of Claim 1: Consider the following matrix

[ @P (X, Z)@t,(0)

1

. . . , @P (X, Z)@t,(0)

k0

] (7)

Because t,(0)1 , t,(0)2 , . . . , t,(0)k0 are the parameters for the joint
distribution P (X, Z), this matrix is the identity matrix if the rows are
properly arranged. So its column vectors are linearly independent almost
everywhere.

Now consider the first k0 column vectors of JM : @P (O, Y)/@t,(0)1 , . .
. , @P (O, Y)/@t,(0)k0 . They must be linearly independent almost everywhere.
If not, one of the vectors, say

@P (O, Y)/@t,(0)k0 , would linearly depend on the rest everywhere according
to Lemma 1. Observe that for any i (1<=i<=k0),

@P (O, Y)

@t,(0)i = XX,Z P (O|X)P (Y|Z)

@P (X, Z)

@t,(0)i .

Choose P (O|X) and P (Y|Z) as in Corollary 1. The vector @P (O, Y)/@t,(0)i
might contain zero elements. If we remove the zero elements, what remains
of the vector is identical to @P (X, Z)/@t,(0)i . So we can conclude that
@P (X, Z)/@t,(0)k0 linearly depends on

12

Effective Dimensions of HLC Models @P (X, Z)/@t,(0)1 . . . , @P (X, Z)/@t,(0)k0-1
everywhere, which contradicts the conclusion of the previous paragraph. Hence
the first k0 vectors of JM must be linearly independent almost everywhere.

It is evident that, using similar arguments, we can also show that the first
k0 vectors of JM1 (JM2) are linearly independent almost everywhere. Claim
1 is therefore proved. 2

5.4 Proof of Claim 2 Every column vector of JM1 linearly depends on vectors
listed in (4) everywhere. Observe that

@P (O, Y)

@t,(0)i = XZ P (Y|Z)

@P (O, Z)

@t,(0)i , i = 1, . . . , k

0

@P (O, Y)

@t,(1)i = XZ P (Y|Z)

@P (O, Z)

@t,(1)i , i = 1, . . . , k

1.

Therefore every column vector of JM that corresponds to vectors in JM1 linearly
depends on the first k0+r vectors listed in (6) everywhere.

By symmetry, every column vector of JM that corresponds to vectors in JM2
linearly depends on the first k0 and the last s vectors listed in (6) everywhere.
The claim is proved. 2

5.5 Proof of Claim 3 We prove this claim by contradiction. Assume the vectors
listed in (6) were not linearly independent almost everywhere. According
to Lemma 1, one of them, say v, must linearly depend on the rest everywhere.
Because of Claim 1 and Lemma 2, we can assume that v is

among the last r+s vectors. Without loss of generality, we assume that v
is @P (O, Y)/@t,(2)s . Then for any value of ~t,, there exist real numbers
ci (1<=i<=k0), c(1)i (1<=i<=r), and c(2)i (1<=i<=s-1) such that

@P (O, Y)

@t,(2)s =

k0X i=1

ci @P (O, Y)@t,(0)

i

+

rX

i=1

c(1)i @P (O, Y)@t,(1)

i

+

s-1X

i=1

c(2)i @P (O, Y)@t,(2)

i

.

Note that in the last term on the right hand side, i runs from 1 to s-1.

The parameter vector ~t, consists of three subvectors ~t,(0), ~t,(1) and
~t,(2). Set the parameters~ t,(1) (for the X-branch) as in Lemma 3. Then
there exists an injective mapping ! from Omega X to Omega O such that

P (O=!(x)|X=x) = 1 for all x 2 Omega X . (8) For each of the vectors in
(6), consider the subvector consisting only of elements for those states
of O that are the images of states of X under the mapping !. Such subvectors

will be denoted by @P (OX , Y)/@t,(0)i , @P (OX , Y)/@t,(1)i , and @P (OX
, Y)/@t,(2)i . For any values of ~t,(0) and ~t,(2), we still have

13

Zhang & Ko^cka @P (OX , Y)

@t,(2)s =

k0X i=1

ci @P (OX , Y)@t,(0)

i

+

rX

i=1

c(1)i @P (OX , Y)@t,(1)

i

+

s-1X

i=1

c(2)i @P (OX , Y)@t,(2)

i

. (9)

Consider the first two terms on the right hand side:

k0X

i=1

ci @P (OX , Y)@t,(0)

i

+

rX

i=1

c(1)i @P (OX , Y)@t,(1)

i

=

k0X

i=1

ci X

Z

P (Y|Z) @P (OX , Z)@t,(0)

i

+

rX

i=1

c(1)i X

Z

P (Y|Z) @P (OX , Z)@t,(1)

i

= X

Z

P (Y|Z){

k0X

i=1

ci @P (OX , Z)@t,(0)

i

+

rX

i=1

c(1)i @P (OX, Z)@t,(1)

i }

Because of (8) and the fact that P (O, Z) = PX P (X, Z)P (O|X), the column
vector @P (OX , Z)/@t,(0)i is identical to the vector @P (X, Z)/@t,(0)i .
As we have argued when proving Claim 1, the vectors {@P (X, Z)/@t,(0)i |i=1,
. . . , k0} constitute a basis for the k0-dimensional Euclidian space. This
implies that, each of the vectors @P (OX , Z)/@t,(1)i can be represented
as a linear combination of the vectors {@P (OX , Z)/@t,(0)i |i = 1, . . .
, k0}. Consequently, there exist c0i (1<=i<=k0) such that

k0X i=1

ci @P (OX , Z)@t,(0)

i

+

rX

i=1

c(1)i @P (OX , Z)@t,(1)

i

=

k0X

i=1

c0i @P (OX , Z)@t,(0)

i

Hence

k0X i=1

ci @P (OX , Y)@t,(0)

i

+

rX

i=1

c(1)i @P (OX, Y)@t,(1)

i

=

k0X

i=1

c0i @P (OX, Y)@t,(0)

i

Combining this equation with equation (9), we get

@P (OX , Y)

@t,(2)s =

k0X i=1

c0i @P (OX , Y)@t,(0)

i

+

s-1X

i=1

c(2)i @P (OX, Y)@t,(2)

i

.

Because of (8) and the fact that the fact that P (O, Y) = PX P (X, Y)P (O|X),
the column vector @P (OX , Y)/@t,(1)i is identical to the vector @P (X, Y)/@t,(1)i
and the column vector @P (OX , Y)/@t,(2)i is identical to the vector @P (X,
Y)/@t,(2)i . Hence

@P (X, Y)

@t,(2)s =

k0X i=1

c0i @P (X, Y)@t,(0)

i

+

s-1X

i=1

c(2)i @P (X, Y)@t,(2)

i

.

This contradicts the fact that the vectors in the equation form a basis for
the column space of JM2 almost everywhere (see (5) in Section 5.2) Therefore,
Claim 3 must be true. 2

14

Effective Dimensions of HLC Models 6. Effective Dimensions of Trees Let us
use the term tree model to refer to Markov random fields on undirected trees
over a finite number of random variables. If we root a tree model at any
of its nodes, we get a tree-structured Bayesian network model. In a tree
model, define leaf nodes be those that have only one neighbor. An HLC model
is a tree model where all leaf nodes are observed while all others are latent.

It turns out that Theorem 1 enables us to compute the effective dimension
of any tree model. Consider an arbitrary tree model. If some of its leaf
nodes are latent, we can remove such nodes without affecting its effective
dimension.

After removing latent leaf nodes, all the leaf nodes are observed. If some
non-leaf nodes are also observed, we can decompose the model into submodels
at any observed non-leaf node. The following theorem tells us how the model
and the submodels are related in terms of effective dimensions.

Theorem 2 Suppose Y is an observed non-leaf node in a tree model M . If M
decomposes at Y into k submodels M1, . . . , Mk, then

de(M ) =

kX

i=1

de(Mi) - (k - 1)(|Y | - 1).

After all possible decompositions, the final submodels either do not contain
latent nodes or are HLC models. Effective dimensions of submodels with no
latent variables are simply their standard dimensions. If an HLC submodel
is irregular, we make it regular by applying the transformation mentioned
at the end of Section 3.2. The transformation does not affect the effective
dimensions of the submodels. Finally, effective dimensions of regular HLC
submodels can be computed using Theorem 1.

Proof of Theorem 2: It is possible to prove this theorem starting from the
Jacobian matrix. Here we take a less formal but more revealing approach.

It suffices to consider case of k being 2. The two submodels M1 and M2 share
only one node, namely Y . Let O1 and O2 be respectively the sets of observed
nodes in those two submodels excluding Y . Root M at Y . Then we have

P (Y, O1, O2)P (Y ) = P (O1, Y )P (O2, Y ). Let ~t,0 be the set of parameters
in the distribution P (Y ), ~t,1 and ~t,2 be respectively the sets of parameters
in the conditional probability distributions of nodes in M1 and M2. Consider
fixing ~t,0 and letting ~t,1 and ~t,2 vary. In this case, the space spanned
by P (Y ) consists of only one vector, namely ~t,0 itself. Moreover, there
is a one-to-one correspondence between vectors in the space spanned by P
(Y, O1, O2) and vectors in the Cartesian product of the spaces spanned by
P (O1, Y ) and P (O2, Y ). Now let ~t,0 vary. This adds |Y |-1 dimensions
to each of the four spaces spanned by P (Y, O1, O2), P (Y ), P (O1, Y ),
and P (O2, Y ). Consequently, we have

de(M ) = de(M1) + de(M2) - (|Y | - 1).

The theorem is proved. 2

15

Zhang & Ko^cka 7. Concluding Remarks In this paper we study the effective
dimensions of HLC models. The work is motivated by empirical evidence that
the BIC behaves quite well when used with several hill-climbing algorithms
for learning HLC models and that the BICe score sometimes leads to better
model selection than the BIC score. We have proved a theorem that relates
the effective dimension of an HLC model to the effective dimensions of two
other HLC models that contain fewer latent variables. Repeated application
of the theorem allows one to reduce the task of computing the effective dimension
of an HLC model to subtasks of computing effective dimensions of LC models.
This makes it computationally feasible to compute the effective dimensions
of large HLC models. In addition, we have proved a theorem about effective
dimensions of general tree models. This and our main theorem allows one to
compute the effective dimension of arbitrary tree models.

Acknowledgements This work was initiated when the authors were visiting Department
of Computer Science, Aalborg University, Denmark. We thank Poul S. Eriksen,
Finn V. Jensen, Jiri Vomlel, Marta Vomlelova, Thomas D. Nielsen, Olav Bangso,
Jose Pena, Kristian G. Olesen. We are also grateful to the annonymous reviewers
whose comments have helped us greatly in improving this paper. Research on
this paper was partially supported by GA CR grant 201/02/1269 and by Hong
Kong Research Grant Council under grant HKUST6088/01E.

References

Akaike, H. (1974). A new look at the statistical model identification. IEEE
Trans. Autom.

Contr., 19, 716-723.

Bartholomew, D. J. and Knott, M. (1999). Latent variable models and factor
analysis, 2nd

edition. Kendall's Library of Statistics 7. London: Arnold.

Cheeseman, P. and Stutz, J. (1995). Bayesian classification (AutoClass):
Theory and

results. In Fayyad, U., Piatesky-Shaoiro, G., Smyth, P., and Uthurusamy,
R. (eds.), Advancesin Knowledge Discovery and Data Mining, AAAI Press, Menlo
Park, CA.

Chickering D. M. and Heckerman D. (1997). Efficient Approximations for the
Marginal

Likelihood of Bayesian Networks with Hidden variables. Machine Learning,
29, 181- 212.

Cowell, R. G., Dawid, A. P., Lauritzen, S. L., and Spiegelhalter, D. J. (1999).
Probabilistic

networks and expert systems, Springer.

Kocka, T. and Zhang, N. L. (2002). Dimension correction for hierarchical
latent class

models. in Proc. of the 18th Conference on Uncertainty in Artificial Intelligence
(UAI-02).

Geiger D., Heckerman D. and Meek C. (1996). Asymptotic model selection for
directed

networks with hidden variables. In Proc. of the 12th Conference on Uncertainty
in Artificial Intelligence, 283-290.

16

Effective Dimensions of HLC Models Goodman, L. A. (1974). Exploratory latent
structure analysis using both identifiable and

unidentifiable models. Biometrika, 61, 215-231.

Lazarsfeld, P. F., and Henry, N.W. (1968). Latent structure analysis. Boston:
Houghton

Mifflin.

Rusakov, D. and Geiger, D. (2002). Asymptotic model selection for Naive Bayesian

networks. UAI-02.

Rusakov, D. and Geiger, D. (2003). Automated analytic asymptotic evaluation
of marginal

likelihood for latent models. UAI-03.

Schwarz G. (1978). Estimating the dimension of a model. In Annals of Statistics,
6, 461-464.

Settimi, R. and Smith, J.Q. (1998). On the geometry of Bayesian graphical
models with

hidden variables. In Proceedings of the Fourteenth Conference on Uncertainty
in Artificial Intelligence, Morgan Kaufmann Publishers, S. Francisco, CA,
472-479.

Settimi, R. and Smith, J.Q. (1999). Geometry, moments and Bayesian networks
with hidden

variables. In Proceedings of the Seventh International Workshop on Artificial
Intelligence and Statistics, Fort Lauderdale, Florida (3-6 January 1999),
Morgan Kaufmann Publishers, S. Francisco, CA.

Zhang N. L. (2002). Hierarchical latent class models for cluster analysis.
AAAI-02, 230-237.

Zhang, N. L., Kocka, T., Karciauskas, G., and Jensen, F. V. (2003). Learning
hierarchical

latent class models. Technical Report HKUST-CS03-01, Department of Computer
Science, Hong Kong University of Science and Technology.

Zhang, N. L. (2003). Structural EM for Hierarchical Latent Class Models.
Technical

Report HKUST-CS03-06, Department of Computer Science, Hong Kong University
of Science and Technology.

Zhang, N. L. (2003b). Hierarchical latent class models for cluster analysis.
Journal of

Machine Learning Research, to appear.

17