N. L. Zhang and T. Kocka
Volume 21, 2004
Links to Full Text:Journal of Artificial Intelligence Research 21 (2004) 1-17 Submitted 7/03; published 1/04 Effective Dimensions of Hierarchical Latent Class Models Nevin L. Zhang lzhang@cs.ust.hkDepartment of Computer Science Hong Kong University of Science and Technology, China Tom'a^s Ko^cka kocka@lisp.vse.cz Laboratory for Intelligent Systems Prague Prague University of Economics, Czech Republic Abstract Hierarchical latent class (HLC) models are tree-structured Bayesian networks whereleaf nodes are observed while internal nodes are latent. There are no theoretically well justified model selection criteria for HLC models in particular and Bayesian networks withlatent nodes in general. Nonetheless, empirical studies suggest that the BIC score is a reasonable criterion to use in practice for learning HLC models. Empirical studies alsosuggest that sometimes model selection can be improved if standard model dimension is replaced with effective model dimension in the penalty term of the BIC score. Effective dimensions are difficult to compute. In this paper, we prove a theorem thatrelates the effective dimension of an HLC model to the effective dimensions of a number of latent class models. The theorem makes it computationally feasible to compute theeffective dimensions of large HLC models. The theorem can also be used to compute the effective dimensions of general tree models. 1. Introduction Hierarchical latent class (HLC) models (Zhang, 2002) are tree-structured Bayesian networks (BNs) where leaf nodes are observed while internal nodes are latent. They generalize latent class models (Lazarsfeld and Henry, 1968) and were first identified as a potentially useful class of Bayesian networks by Pearl (1988). We are concerned with learning HLC models from data. A fundamental question is how to select among competing models. The BIC score (Schwarz, 1978) is a popular metric that researchers use to select among Bayesian network models. It consists of a loglikelihood term that measures the fitness to data and a penalty term that depends linearly upon standard model dimension, i.e. the number of linearly independent standard model parameters. When all variables are observed, the BIC score is an asymptotic approximation of (the logarithm) of the marginal likelihood (Schwarz, 1978). It is also consistent in the sense that, given sufficient data, the BIC score of the generative model -- the model from which data were sampled -- is larger than those of any other models that are not equivalent to the generative model. When latent variables are present, the BIC score is no longer an asymptotic approximation of the marginal likelihood (Geiger et al., 1996). This can be remedied, to some extent, using the concept of effective model dimension. In fact if we replace standard model dimension with effective model dimension in the BIC score, the resulting scoring function, called the BICe score, is an asymptotic approximation of the marginal likelihood almost everywhere except for some singular points (Rusakov and Geiger, 2002). cr^2004 AI Access Foundation. All rights reserved. Zhang & Ko^cka Neither BIC nor BICe have been proved to be consistent for latent variable models. As a matter of fact, it has not even been defined what it means for a model selection criterion to be consistent for latent variable models. Empirical studies suggest that the BIC score is well-behaved in practice for the task of learning HLC models. There are three related searchbased algorithms for learning HLC models, namely double hill-climbing (DHC) (Zhang, 2002), single hill-climbing (SHC) (Zhang et al., 2003), and heuristic SHC (HSHC) (Zhang, 2003). In the absence of a theoretically well justified model selection criterion, Zhang (2002) tested DHC with four existing scoring functions, namely the AIC score (Akaike, 1974), the BIC score, the Cheeseman-Stutz (CS) score (Cheeseman and Stutz, 1995), and the holdout logarithmic score (HLS)(Cowell et al., 1999). Both real-world and synthetic data were used. On the real-world data, BIC and CS have enabled DHC to find models that are regarded as the best by domain experts. On the synthetic data, BIC and CS have enabled DHC to find models that either are identical to or resemble closely the true generative models. When coupled with AIC and HLS, on the other hand, DHC performed significantly worse. SHC and HSHC were tested on synthetic data sampled from fairly large HLC models (as much as 28 nodes). Only BIC was used in those tests. In all cases, BIC has enabled SHC and HSHC to find models that either are identical to or resemble closely the true generative models. Those empirical results not only indicate that the algorithms perform well, but also suggest that the BIC is a reasonable scoring function to use for learning HLC models. The experiments also reveal that model selection can sometimes be improved if the BICe score is used instead of the BIC score. We will explain this in detail in Section 3 In order to use the BICe score in practice, we need a way to compute effective dimensions. This is not a trivial task. The effective dimension of an HLC model is the rank of the Jacobian matrix of the mapping from the parameters of the model to the parameters of the joint distribution of the observed variables. The number of rows in the Jacobian matrix increases exponentially with the number of observed variables. The construction of the Jacobian matrix and the calculation of its rank are both computationally demanding. Moreover they have to be done algebraically or with very high numerical precision to avoid degenerate cases. The necessary precision grows with the size of the matrix. Settimi and Smith (1998, 1999) studied effective dimensions for two classes of models: trees with binary variables and latent class (LC) models with two observed variables. They have obtained a complete characterization of these two classes. Geiger et al. (1996) computed the effective dimensions of a number of models. They conjectured that it is rare for the effective and standard dimensions of an LC model to differ. As a matter of fact, they found only one such model. Kocka and Zhang (2002) found quite a number of LC models whose effective and standard dimensions differ. They also proposed an easily computable formula for estimating effective dimensions of LC models. The estimation formula has been empirically shown to be very accurate. In this paper, we prove a theorem that relates the effective dimension of an HLC model to the effective dimensions of two other HLC models that contain fewer latent variables. Repeated application of the theorem allows one to reduce the task of computing the effective dimension of an HLC model to subtasks of computing effective dimensions of LC models. This makes it computationally feasible to compute the effective dimensions of large HLC models. 2 Effective Dimensions of HLC Models We start in Section 2 with a formal definition of effective dimensions for Bayesian networks with latent variables. In Section 3, we provide empirical evidence that suggest the use of BICe instead of BIC sometimes improves model selection. Section 4 presents the main theorem and Section 5 is devoted to the proof of the theorem. In Section 6, we prove a theorem about effective dimensions of general tree models and explain how this and our main theorem allows one to compute the effective dimension of arbitrary tree models. Finally, concluding remarks are provided in Section 7. 2. Effective Dimensions of Bayesian Networks In this paper, we use capital letters such as X and Y to denote variables and lower case letters such as x and y to denote states of variables. The domain and cardinality of a variable X will be denoted by Omega X and |X| respectively. Bold face capital letters such as Y denote sets of variables. Omega Y denotes the Cartesian product of the domains of all variables in the set Y. Elements of Omega Y will be denoted by bold lower case letters such as y and will sometimes be referred to as states of Y. We will consider only variables that have a finite number of states. Consider a Bayesian network model M that possibly contains latent variables. The standard dimension ds(M ) of M is the number of linearly independent parameters in the standard parameterization of M . The parameters denote, for each variable and each parent configuration of the variable, the probability that the variable is in some state (except one) given the parent configuration. Suppose M consist of k variables x1, x2, . . . , xk. Let ri and qi be respectively the number of states of xi and the number of all possible combinations of the states of its parents. If xi has no parent, let qi be 1. Then ds(M ) is given by ds(M ) = kX i=1 qi(ri - 1). For notational simplicity, denote the standard dimension of M by n. Let ~t,=(t,1, t,2, . . . , t,n) be a vector of n linearly independent model parameters of M . Further let Y be the set of observed variables. Suppose Y has m+1 possible states. We enumerate the first m states as y1, y1, . . . , ym. For any i (1<=i<=m), P (yi) is a function of the parameters ~t,. So we have a mapping from the n dimensional parameter space (a subspace of Rn) to Rm, namely T : (t,1, t,2, . . . , t,n) ` (P (y1), P (y2), . . . , P (ym)). The Jacobian matrix of this mapping is the following m*n matrix: JM (~t,) = [Jij] = [ @P (yi)@t, j ] For convenience, we will often write the matrix as JM = [ @P (Y)@t,j ], with the understanding that elements of the j-th column are obtained by allowing Y run over all its possible states except one. For each i, P (yi) is a function of ~t,. For most commonly used parameterizations of Bayesian networks, it is actually a polynomial function of ~t,. Hence we make the following assumption: 3 Zhang & Ko^cka Assumption 1 The Bayesian network M is so parameterized that the parameters for the joint distribution of the observed variables are polynomial functions of the parameters for M . An obvious consequence of the assumption is that elements of JM are also polynomial functions of ~t,. For a given value of ~t,, JM is a matrix of real numbers. Due to Assumption 1, the rank of this matrix is some constant d almost everywhere in the parameter space (Geiger et al., 1996. Also see Section 5.1.). To be more specific, the rank is d everywhere except in a set of measure zero where it is smaller than d. The constant is called the regular rank of JM . The regular rank of JM is also called the effective dimension of the Bayesian network model M . Hence we denote it by de(M ). To understand the term "effective dimension", consider the subspace of Rm spanned by the joint probability P (Y) of observed variables, or equivalently the range of the mapping T . The term reflects the fact that, for almost every value of ~t,, a small enough open ball around T (~t,) resembles Euclidean space of dimension d (Geiger et al., 1996). There are multiple ways to parameterize a given Bayesian network model. However, the choice of parameterization does not affect the space spanned by the joint probability P (Y). Together with the interpretation of the previous paragraph, this implies that the definition of effective dimension does not depend on the particular parameterization that one uses. 3. Selecting among HLC Models A hierarchical latent class (HLC) model is a Bayesian network where (1) the network structure is a rooted tree and (2) the variables at the leaf nodes are observed and all the other variables are not. The observed variables are sometimes referred to as manifest variables and all the other variables as latent variables. Figure 1 shows the structures of two HLC models. A latent class (LC) model is an HLC model where there is only one latent variable. The theme of this paper is the computation of effective dimensions of HLC models. As mentioned in the introduction, this is interesting because effective dimension, when used in the BIC score, gives us a better approximation of the marginal likelihood. In this section, we give an example to illustrate that the use of effective dimension sometimes also leads to better model selection. We will also motivate and introduce the concept of regularity that will be used in subsequent sections. 3.1 An Example of Model Selection Consider the two HLC models shown in Figure 1. In one experiment, we instantiated the parameters of M1 in a random fashion and sampled a set D1 of 10,000 data records on the observed variables. Then we ran SHC and HSHC on the data set D1 under the guidance of the BIC score. Both algorithms produced model M2. In the following, we explain why, based on D1, one would prefer M2 over M1 if BIC is used for model selection and why M1 would be preferred if BICe is used instead. We argue that M1 should be preferred based on D1 and hence BICe is a better scoring metric for this case. 4 Effective Dimensions of HLC Models X1 X2 Y1 Y2 Y3 X3 Y4 Y5 Y6 M1 X2 Y1 Y2 Y3 X3 Y4 Y5 Y6 M2 Figure 1: Two HLC models. The shaded variables are latent, while the other variables are observed. The cardinality of X1 is 2, while cardinalities of all other variables are 3. The BIC and BICe scores of a model M given a data set D are defined as follows: BIC(M |D) = logP (D|M, ~t,*) - ds(M )2 logN, BICe(M |D) = logP (D|M, ~t,*) - de(M )2 logN where ~t,* is the maximum likelihood estimate of the parameters of M based on D and N is the sample size. In our example, notice that M2 includes M1 in the sense that M2 can represent any probability distributions of the observed variables that M1 can. In fact, if we make the conditional probability distributions of the observed variables in M2 the same as in M1 and set PM2(X2) and PM2(X3|X2) such that PM2(X2)PM2(X3|X2) = X X1 PM1(X1)PM1(X2|X1)PM1(X3|X1), then the probability distribution of the observed variables in the two models are identical. Because M2 includes M1, we have logP (D1|M1, ~t,*1) <= logP (D1|M2, ~t,*2). Together with the fact that D1 is sampled from M1, this implies that logP (D1|M1, ~t,*1) ij logP (D1|M2, ~t,*2) for sufficiently large enough sample size. The standard dimension of M1 is 45, while that of M2 is 44. Hence BIC(M1|D1) < BIC(M2|D1). On the other hand, the effective dimensions of M1 and M2 are 43 and 44 respectively. Hence BICe(M1|D1) > BICe(M2|D1). Model M2 includes M1. The opposite is clearly not true because the effective dimension of M1 is smaller than that of M2. So, M2 is in reality a more complex model than M1. Both model fit data D1 equally well. Hence the simpler one, i.e. M1, should be preferred over the other. This agrees with the choice of the BICe score, while disagrees with the choice of the BIC score. Hence, BICe is more appropriate than BIC in this case. 5 Zhang & Ko^cka 3.2 Regularity Now consider another model M 01 that is the same as M1 except that the cardinality of X1 is increased from 2 to 3. It is easy to show that M2 includes M 01 and vice versa. So, the two models are equivalent in terms of their capabilities of representing probability distributions of the observed variables. They are hence said to be marginally equivalent. However, M 01 has more standard parameters than M2 and hence we would always prefer M2 over M 01. To formalize this consideration, we introduce a concept of regularity. For a latent variable Z in an HLC model, enumerate its neighbors (parent and children) as X1, X2, . . . , Xk. An HLC model is regular if for any latent variable Z, |Z| <= Q ki=1 |Xi| maxki=1 |Xi| , (1) and the strict inequality holds when Z has two neighbors and at least one of them is a latent node. Models M1 and M2 are regular, while model M 01 is not. For any irregular model M there always exists a regular model that is marginally equivalent to M and has fewer standard parameters (Zhang, 2003b). The regular model can be obtained from M as follows: For any latent node that has only two neighbors and its cardinality is no smaller than that of one of the neighbors, then remove the latent node and connect the two neighbors. For any latent node that has more than two neighbors and that violates (1), reduce it's cardinality to the quantity on the right hand side. Repeat both steps until no more changes can be made. It is also interesting to note that the collection of all regular HLC models for a given set of observed variables is finite (Zhang, 2002). This provides a finite search space for the task of learning regular HLC models.1 In the rest of this paper, we will consider only regular HLC models. Before ending this subsection, we point out a nice property of effective model dimension in relation to model inclusion. If an HLC model includes another model, then its effective dimension is no less than that of the latter. As a consequence, two marginally equivalent models have the same effective dimensions and hence the same BICe score. The same is not true for standard model dimension and the BIC score. 3.3 The CS and CSe Scores We have argued on empirical grounds that the BIC score is a reasonable scoring function to use for learning HLC models and that the BICe score can sometimes improve model selection. But the two scores are not free of problems. One problem is that their derivation as Laplace approximations of the marginal likelihood are not valid at the boundary of the parameter space. The CS score in a way alleviates this problem. It involves the BIC score based on completed data and the BIC score based on original data. In other words, it involves two Laplace approximations of the marginal likelihood. It lets errors in the two approximation cancel each other. Chickering and Heckerman (1997) empirically found the CS score to be a quite accurate approximation of the marginal likelihood and robust at the boundary of the parameter 1. The definition of regularity given in this paper is slightly different from the one given in Zhang (2002). Nonetheless, the two conclusions mentioned in this paragraph remain true. 6 Effective Dimensions of HLC Models Y 21 MMM X Z YO ZXX Z O Figure 2: Problem reduction. space. They realized the need for effective model dimension in the CS score, although they did not actually use it. This would not have made any differences to their experiments because, for the models they used, the standard and effective dimensions agree. We use CSe to refer to the scoring function one obtains by replacing standard model dimension in the CS score with effective model dimensions. Just as BICe is better than BIC as approximations of the marginal likelihood (Geiger et al., 1996), CSe is better than CS. To compute CSe, we also need to calculate effective dimensions. 4. Effective Dimensions of HLC Models As we have seen, effective model dimension is interesting for a number of reasons. Our main result in this paper is a theorem about the effective dimension de(M ) of a regular HLC model M that contains more than one latent variable. Let X be the root of M , which is a latent node. Because there are at least two latent nodes, there must exist another latent node Z that is a child of X. In the following, we will use the terms X-branch and Z-branch to respectively refer to the sets of nodes that are separated from Z by X or from X by Z. Let Y be the set of observed variables in the Z-branch and let O be the set of all other observed variables. Note that the X-branch doesn't contain the node X. The relationship among X, Z, Y, and O is depicted in the left-most picture of Figure 2. The standard parameterization of M includes parameters for P (X) and parameters for P (Z|X). For convenience, we replace those parameters with parameters for P (X, Z). As mentioned at the end of Section 2, such reparameterization does not affect the effective dimension de(M ). To reflect the reparameterization, the edge between X and Z is not directed in Figure 2. Suppose P (X, Z) has k0 parameters t,(0)1 , t,(0)2 , . . . , t,(0)k0 . Suppose the conditional distributions of variables in the X-branch consists of k1 parameters t,(1)1 , t,(1)2 , . . . , t,(1)k1 and the conditional distributions of variables in the Z-branch consists of k2 parameters t,(2)1 , t,(2)2 , . . . , t,(2)k2 . For convenience we will sometimes refer to those three groups of parameters using three vectors ~t,(0), ~t,(1) and ~t,(2) respectively. In the following, we will define two other HLC models M1 and M2 starting from M and establish a relationship between their effective dimensions and the effective dimension of M . In this context, M , M1, and M2 are regarded purely as Mathematical objects. The semantics of their variables are of no concern. In particular, a variable H that is latent 7 Zhang & Ko^cka X1(6) X2(3) X4(5) Y1(3) Y2(2) Y3(3) X3(3) X5(5) Y4(2) Y5(6) X1 X2 Y3 X3 X2 X4 X1 X3 X1 X5 X4 Y1 Y2 X2 X5 X3 Y4 Y5 Figure 3: The picture on the left shows an HLC model with five observed and five latent variables, each variable is annotated by its name and its cardinality. The picture on the right shows the components we can decompose the HLC model into by applying Theorem 1. Latent variables are shaded, while observed variables are not. in M might be designated to be observed in M1 or M2 as part of the definition of those Mathematical objects. We obtain a Bayesian network model B1 from M by deleting the Z-branch. Strictly speaking B1 is not Bayesian network due to the parameterization it inherits from M : instead of probability tables P (X) and P (Z|X), we have table P (X, Z). But P (X) and P (Z|X) can readily be obtained from P (X, Z). With this in mind, we view B1 as a Bayesian network. This network is obviously tree-structured. It's leaf variables include those in the set O and the variable Z. We define M1 to be the HLC model that share the same structure as B1 and where the variable Z and all the variables in O are observed. The parameters of M1 are ~t,(0) and ~t,(1). Similarly let B2 be the Bayesian network model obtained from M by deleting the X- branch. It is a tree-structure and its leaf variables include those in Y and the variable X. We define M2 to be the HLC model that share the same structure as B2 and where the variable X and all the variables in Y are observed. The parameters of M2 are ~t,(0) and ~t,(2). Theorem 1 Suppose M is a regular HLC model that contains two or more latent nodes. Then the two HLC models M1 and M2 defined in the text are also regular. Moreover, de(M ) = de(M1)+de(M2)-[ds(M1)+ds(M2)-ds(M )]. (2) In words, the effective dimension of M equals the sum of the effective dimensions of M1 and M2 minus the number of common parameters that M1 and M2 share. To appreciate the significance of this theorem, consider the task of computing the effective dimension of a regular HLC model that contains two or more latent nodes. By 8 Effective Dimensions of HLC Models repeatedly applying the theorem, we can reduce the task into subtasks of calculating effective dimensions of LC models. As an example, consider the HLC model depicted by the picture on the left in Figure 3. Theorem 1 allows us to, for the purpose of computing its effective dimension, decompose the HLC model into five LC models, which are shown on the right in Figure 3. How might one compute the effective dimension of an LC model? One way is to use the algorithm suggested by Geiger et al. (1996). The algorithm first symbolically computes the Jacobian matrix, which is possible due to Assumption 1. Then it randomly assigns values to the parameters, resulting a numerical matrix. The rank of the numerical matrix is computed by diagonalization. Because the rank of Jacobian matrix equals the effective dimension of the LC model almost everywhere, we get the regular rank with probability one. This algorithm has recently been implemented by Rusakov and Geiger (2003). Kocka and Zhang (2002) suggest an alternative algorithm that computes an upper bound. The algorithm is fast and has been empirically shown to produce extremely tight bounds. Going back to our example, the effective dimension of the LC models for X1, X2, X3, X4 and X5 are 26, 23, 23, 34 and 17 respectively. Thus the effective dimension of the HLC model in Figure 3 is 26+23+34+23+17-(5*3-1)-(3*6-1)-(6*3-1)-(3*5-1) = 61. In contrast, the standard dimension of the model is 5+6*2+6*2+6*2+3*4+5*5+5+3*4+5*2+5 = 110. 5. Proof of Main Result This section is devoted to the proof of Theorem 1. We begin with some properties of Jacobian matrices of Bayesian network models. 5.1 Properties of Jacobian Matrices Consider the Jacobian matrix JM of a Bayesian network model M . It is a matrix parameterized by the parameters ~t, of M . Let v1, v2, . . . , vm be column vectors of JM . Lemma 1 A number of column vectors v1, v2, . . . , vm of the Jacobian matrix JM are either linearly dependent everywhere or linearly independent almost everywhere. They are linearly dependent everywhere if and only if there exists at least one column vector vj that can be expressed as a linear combination of other column vectors everywhere. Proof: Consider diagonalizing the following transposed matrix: [v1, v2, . . . , vm]T . According to Assumption 1, elements of the matrix are polynomials (of ~t,). Hence we would multiply rows with polynomials or fraction of polynomials. Of course, we need also to add one row to another row. At the end of the process, we get a diagonal matrix whose nonzero elements are polynomials or fractions of polynomials. Suppose there are k nonzero rows and suppose they correspond to v1, v2, . . . , vk. Because elements of the diagonalized matrix are polynomials or fractions of polynomials, they are well-defined 2 and nonzero almost everywhere (i.e. for almost all values of ~t,). If k=m, then the m vectors are linearly independent of each other almost everywhere. 2. A fraction is not well defined if the denominator is zero. 9 Zhang & Ko^cka If k=k0. Suppose de(M1)=k0+r. Without loss of generality, suppose the basis vectors are @P (O, Z) @t,(0)1 , . . . , @P (O, Z) @t,(0)k0 ; @P (O, Z) @t,(1)1 , . . . , @P (O, Z) @t,(1)r . (4) By symmetry, we can assume that de(M2)=k0+s where s>=0 and that the following column vectors form a basis for JM2 almost everywhere: @P (X, Y) @t,(0)1 , . . . , @P (X, Y) @t,(0)k0 ; @P (X, Y) @t,(2)1 , . . . , @P (X, Y) @t,(2)s . (5) Now consider the following list of vectors in JM : @P (O, Y) @t,(0)1 , . . . , @P (O, Y) @t,(0)k0 ; @P (O, Y) @t,(1)1 , . . . , @P (O, Y) @t,(1)r ; @P (O, Y) @t,(2)1 , . . . , @P (O, Y) @t,(2)s . (6) We will show Claim 2: All column vectors of JM linearly depend on the vectors listed in (6) everywhere. Claim 3: The vectors listed in (6) are linearly independent almost everywhere. Those two claims imply that the vectors listed in (6) form a basis of the column space of JM almost everywhere. Therefore de(M ) = k0+r+s = de(M1)+de(M2)-k0. It is clear that k0=ds(M1)+ds(M2)-ds(M ). Therefore Theorem 1 is proved. 2 5.3 Proof of Claim 1 Lemma 3 Let Z be a latent node in an HLC model M and Y be the set of the observed nodes in the subtree rooted at Z. If M is regular, then we can set conditional distributions of nodes in the subtree in such a way that they encode an injective mapping ! from Omega Z to Omega Y in the sense that P (Y=!(z)|Z=z) = 1 for all z 2 Omega Z . 11 Zhang & Ko^cka Proof: We prove this lemma by induction on the number of latent nodes in the subtree rooted at Z. First consider the case when there is only one latent node, namely Z. In this case, Z is the parent of all nodes in Y. Enumerate all these nodes as Y1, Y2, . . . , Yk. Because M is regular, we have |Z| <= Qki=1 |Yi|. Hence we can define an injective mapping ! from Omega Z to Omega Y = Qki=1 Omega Yi. For each state z of Z, !(z) can be written as y = (y1, y2, . . . , yk), where yi is a state of Yi. Now if we set P (Yi=yi|Z=z) = 1, then P (Y=!(z)|Z=z) = 1. Now consider the case when there are at least two hidden nodes in the subtree rooted at Z. Let W be one such latent node that has no latent node descendants. Let Y(1) be the set of observed nodes in the subtree rooted at W and Y(2)=YY(1). By the induction hypothesis, we can parameterize the subtree rooted at W in such a way that it encodes an injective mapping from Omega W to Omega Y(1). Moreover, if all nodes below W are removed from M , M remains a regular HLC model. In that model, we can parameterize the subtree rooted at Z in such a way that it encodes an injective mapping from Omega Z to Omega (W,Y(2)) = Omega W * Omega Y(2). Together, those two facts prove the lemma. 2 Corollary 1 Let Z be a latent node in an HLC model M . Suppose Z have a latent neighbor X. Let Y be the set of the observed nodes separated from X by Z. If M is regular, then we can set probability distributions of nodes separated from X by Z in such a way that they encode an injective mapping ! from Omega Z to Omega Y in the sense that P (Y=!(z)|Z=z) = 1 for all z 2 Omega Z . Proof: The corollary follows readily from Lemma 3 and the property of the root-walking operation (Zhang, 2002). 2 Proof of Claim 1: Consider the following matrix [ @P (X, Z)@t,(0) 1 . . . , @P (X, Z)@t,(0) k0 ] (7) Because t,(0)1 , t,(0)2 , . . . , t,(0)k0 are the parameters for the joint distribution P (X, Z), this matrix is the identity matrix if the rows are properly arranged. So its column vectors are linearly independent almost everywhere. Now consider the first k0 column vectors of JM : @P (O, Y)/@t,(0)1 , . . . , @P (O, Y)/@t,(0)k0 . They must be linearly independent almost everywhere. If not, one of the vectors, say @P (O, Y)/@t,(0)k0 , would linearly depend on the rest everywhere according to Lemma 1. Observe that for any i (1<=i<=k0), @P (O, Y) @t,(0)i = XX,Z P (O|X)P (Y|Z) @P (X, Z) @t,(0)i . Choose P (O|X) and P (Y|Z) as in Corollary 1. The vector @P (O, Y)/@t,(0)i might contain zero elements. If we remove the zero elements, what remains of the vector is identical to @P (X, Z)/@t,(0)i . So we can conclude that @P (X, Z)/@t,(0)k0 linearly depends on 12 Effective Dimensions of HLC Models @P (X, Z)/@t,(0)1 . . . , @P (X, Z)/@t,(0)k0-1 everywhere, which contradicts the conclusion of the previous paragraph. Hence the first k0 vectors of JM must be linearly independent almost everywhere. It is evident that, using similar arguments, we can also show that the first k0 vectors of JM1 (JM2) are linearly independent almost everywhere. Claim 1 is therefore proved. 2 5.4 Proof of Claim 2 Every column vector of JM1 linearly depends on vectors listed in (4) everywhere. Observe that @P (O, Y) @t,(0)i = XZ P (Y|Z) @P (O, Z) @t,(0)i , i = 1, . . . , k 0 @P (O, Y) @t,(1)i = XZ P (Y|Z) @P (O, Z) @t,(1)i , i = 1, . . . , k 1. Therefore every column vector of JM that corresponds to vectors in JM1 linearly depends on the first k0+r vectors listed in (6) everywhere. By symmetry, every column vector of JM that corresponds to vectors in JM2 linearly depends on the first k0 and the last s vectors listed in (6) everywhere. The claim is proved. 2 5.5 Proof of Claim 3 We prove this claim by contradiction. Assume the vectors listed in (6) were not linearly independent almost everywhere. According to Lemma 1, one of them, say v, must linearly depend on the rest everywhere. Because of Claim 1 and Lemma 2, we can assume that v is among the last r+s vectors. Without loss of generality, we assume that v is @P (O, Y)/@t,(2)s . Then for any value of ~t,, there exist real numbers ci (1<=i<=k0), c(1)i (1<=i<=r), and c(2)i (1<=i<=s-1) such that @P (O, Y) @t,(2)s = k0X i=1 ci @P (O, Y)@t,(0) i + rX i=1 c(1)i @P (O, Y)@t,(1) i + s-1X i=1 c(2)i @P (O, Y)@t,(2) i . Note that in the last term on the right hand side, i runs from 1 to s-1. The parameter vector ~t, consists of three subvectors ~t,(0), ~t,(1) and ~t,(2). Set the parameters~ t,(1) (for the X-branch) as in Lemma 3. Then there exists an injective mapping ! from Omega X to Omega O such that P (O=!(x)|X=x) = 1 for all x 2 Omega X . (8) For each of the vectors in (6), consider the subvector consisting only of elements for those states of O that are the images of states of X under the mapping !. Such subvectors will be denoted by @P (OX , Y)/@t,(0)i , @P (OX , Y)/@t,(1)i , and @P (OX , Y)/@t,(2)i . For any values of ~t,(0) and ~t,(2), we still have 13 Zhang & Ko^cka @P (OX , Y) @t,(2)s = k0X i=1 ci @P (OX , Y)@t,(0) i + rX i=1 c(1)i @P (OX , Y)@t,(1) i + s-1X i=1 c(2)i @P (OX , Y)@t,(2) i . (9) Consider the first two terms on the right hand side: k0X i=1 ci @P (OX , Y)@t,(0) i + rX i=1 c(1)i @P (OX , Y)@t,(1) i = k0X i=1 ci X Z P (Y|Z) @P (OX , Z)@t,(0) i + rX i=1 c(1)i X Z P (Y|Z) @P (OX , Z)@t,(1) i = X Z P (Y|Z){ k0X i=1 ci @P (OX , Z)@t,(0) i + rX i=1 c(1)i @P (OX, Z)@t,(1) i } Because of (8) and the fact that P (O, Z) = PX P (X, Z)P (O|X), the column vector @P (OX , Z)/@t,(0)i is identical to the vector @P (X, Z)/@t,(0)i . As we have argued when proving Claim 1, the vectors {@P (X, Z)/@t,(0)i |i=1, . . . , k0} constitute a basis for the k0-dimensional Euclidian space. This implies that, each of the vectors @P (OX , Z)/@t,(1)i can be represented as a linear combination of the vectors {@P (OX , Z)/@t,(0)i |i = 1, . . . , k0}. Consequently, there exist c0i (1<=i<=k0) such that k0X i=1 ci @P (OX , Z)@t,(0) i + rX i=1 c(1)i @P (OX , Z)@t,(1) i = k0X i=1 c0i @P (OX , Z)@t,(0) i Hence k0X i=1 ci @P (OX , Y)@t,(0) i + rX i=1 c(1)i @P (OX, Y)@t,(1) i = k0X i=1 c0i @P (OX, Y)@t,(0) i Combining this equation with equation (9), we get @P (OX , Y) @t,(2)s = k0X i=1 c0i @P (OX , Y)@t,(0) i + s-1X i=1 c(2)i @P (OX, Y)@t,(2) i . Because of (8) and the fact that the fact that P (O, Y) = PX P (X, Y)P (O|X), the column vector @P (OX , Y)/@t,(1)i is identical to the vector @P (X, Y)/@t,(1)i and the column vector @P (OX , Y)/@t,(2)i is identical to the vector @P (X, Y)/@t,(2)i . Hence @P (X, Y) @t,(2)s = k0X i=1 c0i @P (X, Y)@t,(0) i + s-1X i=1 c(2)i @P (X, Y)@t,(2) i . This contradicts the fact that the vectors in the equation form a basis for the column space of JM2 almost everywhere (see (5) in Section 5.2) Therefore, Claim 3 must be true. 2 14 Effective Dimensions of HLC Models 6. Effective Dimensions of Trees Let us use the term tree model to refer to Markov random fields on undirected trees over a finite number of random variables. If we root a tree model at any of its nodes, we get a tree-structured Bayesian network model. In a tree model, define leaf nodes be those that have only one neighbor. An HLC model is a tree model where all leaf nodes are observed while all others are latent. It turns out that Theorem 1 enables us to compute the effective dimension of any tree model. Consider an arbitrary tree model. If some of its leaf nodes are latent, we can remove such nodes without affecting its effective dimension. After removing latent leaf nodes, all the leaf nodes are observed. If some non-leaf nodes are also observed, we can decompose the model into submodels at any observed non-leaf node. The following theorem tells us how the model and the submodels are related in terms of effective dimensions. Theorem 2 Suppose Y is an observed non-leaf node in a tree model M . If M decomposes at Y into k submodels M1, . . . , Mk, then de(M ) = kX i=1 de(Mi) - (k - 1)(|Y | - 1). After all possible decompositions, the final submodels either do not contain latent nodes or are HLC models. Effective dimensions of submodels with no latent variables are simply their standard dimensions. If an HLC submodel is irregular, we make it regular by applying the transformation mentioned at the end of Section 3.2. The transformation does not affect the effective dimensions of the submodels. Finally, effective dimensions of regular HLC submodels can be computed using Theorem 1. Proof of Theorem 2: It is possible to prove this theorem starting from the Jacobian matrix. Here we take a less formal but more revealing approach. It suffices to consider case of k being 2. The two submodels M1 and M2 share only one node, namely Y . Let O1 and O2 be respectively the sets of observed nodes in those two submodels excluding Y . Root M at Y . Then we have P (Y, O1, O2)P (Y ) = P (O1, Y )P (O2, Y ). Let ~t,0 be the set of parameters in the distribution P (Y ), ~t,1 and ~t,2 be respectively the sets of parameters in the conditional probability distributions of nodes in M1 and M2. Consider fixing ~t,0 and letting ~t,1 and ~t,2 vary. In this case, the space spanned by P (Y ) consists of only one vector, namely ~t,0 itself. Moreover, there is a one-to-one correspondence between vectors in the space spanned by P (Y, O1, O2) and vectors in the Cartesian product of the spaces spanned by P (O1, Y ) and P (O2, Y ). Now let ~t,0 vary. This adds |Y |-1 dimensions to each of the four spaces spanned by P (Y, O1, O2), P (Y ), P (O1, Y ), and P (O2, Y ). Consequently, we have de(M ) = de(M1) + de(M2) - (|Y | - 1). The theorem is proved. 2 15 Zhang & Ko^cka 7. Concluding Remarks In this paper we study the effective dimensions of HLC models. The work is motivated by empirical evidence that the BIC behaves quite well when used with several hill-climbing algorithms for learning HLC models and that the BICe score sometimes leads to better model selection than the BIC score. We have proved a theorem that relates the effective dimension of an HLC model to the effective dimensions of two other HLC models that contain fewer latent variables. Repeated application of the theorem allows one to reduce the task of computing the effective dimension of an HLC model to subtasks of computing effective dimensions of LC models. This makes it computationally feasible to compute the effective dimensions of large HLC models. In addition, we have proved a theorem about effective dimensions of general tree models. This and our main theorem allows one to compute the effective dimension of arbitrary tree models. Acknowledgements This work was initiated when the authors were visiting Department of Computer Science, Aalborg University, Denmark. We thank Poul S. Eriksen, Finn V. Jensen, Jiri Vomlel, Marta Vomlelova, Thomas D. Nielsen, Olav Bangso, Jose Pena, Kristian G. Olesen. We are also grateful to the annonymous reviewers whose comments have helped us greatly in improving this paper. Research on this paper was partially supported by GA CR grant 201/02/1269 and by Hong Kong Research Grant Council under grant HKUST6088/01E. References Akaike, H. (1974). A new look at the statistical model identification. IEEE Trans. Autom. Contr., 19, 716-723. Bartholomew, D. J. and Knott, M. (1999). Latent variable models and factor analysis, 2nd edition. Kendall's Library of Statistics 7. London: Arnold. Cheeseman, P. and Stutz, J. (1995). Bayesian classification (AutoClass): Theory and results. In Fayyad, U., Piatesky-Shaoiro, G., Smyth, P., and Uthurusamy, R. (eds.), Advancesin Knowledge Discovery and Data Mining, AAAI Press, Menlo Park, CA. Chickering D. M. and Heckerman D. (1997). Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden variables. Machine Learning, 29, 181- 212. Cowell, R. G., Dawid, A. P., Lauritzen, S. L., and Spiegelhalter, D. J. (1999). Probabilistic networks and expert systems, Springer. Kocka, T. and Zhang, N. L. (2002). Dimension correction for hierarchical latent class models. in Proc. of the 18th Conference on Uncertainty in Artificial Intelligence (UAI-02). Geiger D., Heckerman D. and Meek C. (1996). Asymptotic model selection for directed networks with hidden variables. In Proc. of the 12th Conference on Uncertainty in Artificial Intelligence, 283-290. 16 Effective Dimensions of HLC Models Goodman, L. A. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika, 61, 215-231. Lazarsfeld, P. F., and Henry, N.W. (1968). Latent structure analysis. Boston: Houghton Mifflin. Rusakov, D. and Geiger, D. (2002). Asymptotic model selection for Naive Bayesian networks. UAI-02. Rusakov, D. and Geiger, D. (2003). Automated analytic asymptotic evaluation of marginal likelihood for latent models. UAI-03. Schwarz G. (1978). Estimating the dimension of a model. In Annals of Statistics, 6, 461-464. Settimi, R. and Smith, J.Q. (1998). On the geometry of Bayesian graphical models with hidden variables. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers, S. Francisco, CA, 472-479. Settimi, R. and Smith, J.Q. (1999). Geometry, moments and Bayesian networks with hidden variables. In Proceedings of the Seventh International Workshop on Artificial Intelligence and Statistics, Fort Lauderdale, Florida (3-6 January 1999), Morgan Kaufmann Publishers, S. Francisco, CA. Zhang N. L. (2002). Hierarchical latent class models for cluster analysis. AAAI-02, 230-237. Zhang, N. L., Kocka, T., Karciauskas, G., and Jensen, F. V. (2003). Learning hierarchical latent class models. Technical Report HKUST-CS03-01, Department of Computer Science, Hong Kong University of Science and Technology. Zhang, N. L. (2003). Structural EM for Hierarchical Latent Class Models. Technical Report HKUST-CS03-06, Department of Computer Science, Hong Kong University of Science and Technology. Zhang, N. L. (2003b). Hierarchical latent class models for cluster analysis. Journal of Machine Learning Research, to appear. 17