A. J. Grove, J. Y. Halpern and D. Koller

Volume 2, 1994

**Links to Full Text:**

Given a knowledge base

Journal of Artificial Intelligence Research 2 {1994} 33{88 Submitted 4/94; published 8/94 Random Worlds and Maximum Entropy Adam J. Grove grove@research.nj.nec.com NEC Research Institute, 4 Independence Way Princeton, NJ 08540 Joseph Y. Halpern halpern@almaden.ibm.com IBM Almaden Research Center, 650 Harry Rd. San Jose, CA 95120 Daphne Koller daphne@cs.berkeley.edu Computer Science Division, University of California Berkeley, CA 94720 Abstract Given a knowledge base KB containingfirst-order and statistical facts, we consider a principled method, called the random-worlds method, for computing a degree of belief that some formula ' holds given KB. If we are reasoning about a world or system consisting of N individuals, then we can consider all possible worlds, or first-order models,with domain f1; : : :; N g that satisfy KB, and compute the fraction of them in which ' is true. We define the degree of belief to be the asymptotic value of this fraction as N growslarge. We show that when the vocabulary underlying ' and KB uses constants and unary predicates only, we can naturally associate an entropy with each world. AsN grows larger, there are many more worlds with higher entropy. Therefore, we can use a maximum-entropy computation to compute the degree of belief. This result is in a similar spirit to previous work in physics and artificial intelligence, but is far more general. Of equal interest to the result itself are the limitations on its scope. Most importantly, the restriction to unary predicates seems necessary. Although the random-worlds method makes sense in general, the connection to maximum entropy seems to disappear in the non-unary case. These observations suggest unexpected limitations to the applicability of maximum-entropy methods. 1. Introduction Consider an agent {or expert system} with some information about a particular subject, such as internal medicine. Some facts, such as \all patients with hepatitis exhibit jaundice", can be naturally expressed in a standard first-order logic, while others, such as \80045 of patients that exhibit jaundice have hepatitis", are statistical. Suppose the agent wants to use this information to make decisions. For example, a doctor might need to decide whether to administer antibiotics to a particular patient Eric. To apply standard tools of decision theory {see {Luce & Raiffa, 1957} for an introduction}, the agent must assign probabilities, or degrees of belief, to various events. For example, the doctor may need to assign a degree of belief to an event such as \Eric has hepatitis". We would therefore like techniques for computing degrees of belief in a principled manner, using all the data at hand. In this paper we investigate the properties of one particular formalism for doing this. The method we consider, which we call the random-worlds method, has origins that go back to Bernoulli and Laplace {1820}. It is essentially an application of what has been c fl1994 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved.Grove, Halpern, & Koller called the principle of indifference {Keynes, 1921}. The basic idea is quite straightforward. Suppose we are interested in attaching a degree of belief to a formula ' given a knowledge base KB . One useful way of assigning semantics to degrees of belief formulasis to use a probability distribution over a set of possible worlds {Halpern, 1990}. More concretely, supposefor now that we are reasoning about N individuals, 1; : : :; N . A world is a complete description of which individuals haveeach of the properties of interest. Formally, a world is just a model, or interpretation, over our first-order language. For example, if our lan- guage consists of the unary predicates Hepatitis, Jaundice, Child , and BlueEyed, the binary predicate Infected-By, and the constant Eric, then a world describes which subset of the N individuals satisfies each of the unary predicates, which set of pairs is in the Infected-By relation, and which of the N individuals is Eric. Given a prior probability distribution overthe set of possible worlds, the agent can obtain a degree of belief in ' given KB by conditioning on KB to obtain a posterior distribution, and then computing the probability of ' according to this new distribution. The random-worlds method uses the principle of indifference to choose a particular prior distribution over the set of worlds: all the worlds are taken to be equally likely. It is easy to see that the degree of belief in ' given KB is then precisely the fraction of worlds satisfying KB that also satisfy '. The approach so far described applies whenever we actually know the precise domain size N ; unfortunately this is fairly uncommon. In many cases, however, it is reasonable to believe thatN is\large". We are thus particularly interested in the asymptotic behavior of this fraction; that is, we take our degree of belief to be the asymptotic value of this fraction as N grows large. For example, suppose we want to reason about a domain of hospital patients, and KB is the conjunction of the following four formulas: * 8x{Hepatitis{x} } Jaundice{x}) {\all patients with hepatitis exhibit jaundice"}, * kHepatitis{x}|Jaundice{x}k x 031 0:8 {\approximately 80045 of patients that exhibit jaun- dice have hepatitis"; we explain this formalism and the reason we say \approximately 80045" rather than \exactly 80045" in Section 2}, * jjBlueEyed{x}jj x 031 0:25 {\approximately 25045 of patients have blue eyes"}, * Jaundice{Eric} ^ Child{Eric} {\Eric is a child who exhibits jaundice"}. Let' be Hepatitis{Eric}; that is, we want to ascribe a degree of belief to the statement \Eric has hepatitis". Suppose the domain has size N . Thenwe want to consider all worlds with domain{1; : : :; N } such that the set of individuals satisfying Hepatitis is a subset of those satisfying Jaundice, approximately 80045 of the individuals satisfying Jaundice also satisfy Hepatitis, approximately 25045 of the individuals satisfy BlueEyed, and {the interpretation of} Eric is an individual satisfying Jaundice and Child. It is straightforward to show that, as expected, Hepatitis{Eric} holds in approximately 80045 of these structures. Moreover, as N gets large, the fraction of structures in which Hepatitis{Eric} holds converges to exactly 0:8. Since 80045 of the patients that exhibit jaundice have hepatitis and Eric exhibits jaundice, a degree of belief of 0.8 that Eric has hepatitis seems justifiable. Note that, in this example, the information that Eric is a child is essentially treated as irrelevant. We would get the same answer if we did not have the information Child{Eric}. It can also be shown that 34Random Worlds and Maximum Entropy the degree of belief in BlueEyed {Eric} converges to 0:25 as N gets large. Furthermore, the degree of belief of BlueEyed{Eric} ^ Jaundice{Eric} converges to 0:2, the product of 0:8 and 0:25. As we shall see, this is because the random-worlds method treats BlueEyed and Jaundice as being independent, which is reasonable because there is no evidence to the contrary. {It would surely be strange to postulate that two properties were correlated unless there were reason to believe they were connected in some way.} Thus, at least in this example, the random-worlds method gives answers that follow from the heuristic assumptions made in many standard AI systems {Pearl, 1989; Pollock, 1984; Spiegelhalter, 1986}. Are such intuitive results typical? When do we get convergence? And when we do, is there a practical way to compute degrees of belief? The answer to the first question is yes, as we discuss in detail in {Bacchus, Grove, Halpern,& Koller, 1994}. In that paper, we show that the random-worlds method is re- markably successful at satisfying the desiderata of both nonmonotonic {default} reasoning {Ginsberg, 1987} and reference class reasoning {Kyburg, 1983}. The results of {Bacchus et al., 1994} show that the behavior we saw in the example above holds quite generally, as do many other properties we would hope to have satisfied. Thus, in this paper we do notspend time justifying the random-worlds approach, nor do we discuss its strengths and weaknesses; the reader is referred to {Bacchus et al., 1994} for such discussion and for an examination of previous work in the spirit of random worlds {most notably {Carnap, 1950, 1952} and subsequent work}. Rather, we focus on the latter two questions asked above. These questions may seem quite familiar to readers aware of the work on asymp- totic probabilities for various logics. For example, in the context of first-order formulas, it is well-known that a formula with no constant or function symbols has an asymptotic probability of either 0 or 1 {Fagin, 1976; Glebski025020, Kogan, Liogon'ki025020, & Talanov, 1969}. Furthermore, we can decide which {Grandjean, 1983}. However, the 0-1 law fails if the language includes constants or if we look at conditional probabilities {Fagin, 1976}, and we need both these features in order to reason about degrees of belief for formulas involving particular individuals, conditioned on what is known. In two companion papers {Grove, Halpern, & Koller, 1993a,1993b}, we consider the question of what happens in the pure first-order case {where there is no statistical informa- tion} in greater detail. We show that as long as there is at least one binary predicate symbol in the language, then not only do we not get asymptotic conditional probabilities in general {as was already shown by Fagin {1976}), but almost all the questions one might want to ask {such as deciding whether the limiting probability exists} are highly undecidable. However, if we restrict to a vocabulary with only unary predicate symbols and constants, then as long as the formula on which we are conditioning is satisfiable in arbitrarily large models {a question which is decidable in the unary case}, the asymptotic conditional probability existsand can be computed effectively. In this paper, we consider the much more useful case where the knowledge base has statistical as well as first-order information. In light of the results of {Grove et al., 1993a, 1993b}, for most of the paper we restrict attention to the case when the knowledge base is expressed in a unary language. Our major result involves showing that asymptotic condi- tional probabilities can often be computed using the principle of maximum entropy {Jaynes, 1957; Shannon & Weaver, 1949}. 35Grove, Halpern, & Koller To understand the use of maximum entropy, suppose the vocabulary consists of the unary predicate symbols P 1 ; : : :; P k . We can consider the 2 k atoms that can be formed from these predicate symbols, namely, the formulas of the form P 0 1 ^: : : ^ P 0 k , where each P 0 i is either P i or :P i . We can view the knowledge base as placing constraints on the proportion ofdomain elements satisfying each atom. For example, the constraint kP 1 {x}|P 2 {x}k x = 1=2 says that the proportion of the domain satisfying some atom that contains P 2 as a conjunct is twice the proportion satisfying atoms that contain both P 1 and P 2 as conjuncts. Given a model of KB, we can define the entropy of this model as the entropy of the vector denoting the proportions of the different atoms. We show that, as N grows large, there are many more models with high entropy than with lower entropy. Therefore, models with high entropy dominate. We use this concentration phenomenon to show that our degree of belief in ' given KB according to the random-worlds method is closely related to the assignment of proportions to atoms that has maximum entropy among all assignments consistent with the constraints imposed by KB . The concentration phenomenon relating entropy to the random-worlds method is well- known {Jaynes, 1982, 1983}. In physics, the \worlds" are the possible configurations of a system typically consisting of many particles or molecules, and the mutually exclusive properties {our atoms} can be, for example, quantum states. The corresponding entropy measure is at the heart of statistical mechanics and thermodynamics. There are subtle but importantdifferences between our viewpoint and that of the physicists. The main one lies in our choice of language. We want to express some intelligent agent's knowledge, which is why we take first-order logic as our starting point. The most specific difference concerns constant symbols. We need these because the most interesting questions for us arise when we have some knowledge about|and wish to assign degrees of belief to statements concerning|a particular individual. The parallel in physics would address properties of a single particle, which is generally considered to be well outside the scope of statistical mechanics. Another work that examines the connection between random worlds and entropy from our point of view|computing degrees of belief for formulas in a particular logic|is that of Paris and Vencovska {1989}. They restrict the knowledge base to consist of a conjunction of constraints that {in our notation} have the form kff{x}|fi{x}k x 031 r and jjff{x}jj x 031 r, where fi and ff are quantifier-free formulas involving unary predicates only, with no constant symbols. Not only is most of the expressive power of first-order logic not available in their approach, but the statistical information that can be expressed is quite limited. For example, it is not possible to make general assertions about statistical independence. Paris and Vencovska show that the degree of belief can be computed using maximum entropy for their language. Shastri {1989} has also shown such a result, of nearly equivalent scope. But, as we have already suggested, we believe thatit is important to look at a far richer language. Our language allows arbitrary first-order assertions, full Boolean logic, arbitrary polynomial combinations of statistical expressions, and more; these are all features that are actually useful to knowledge-representation practitioners. Furthermore, the random-worlds method makes perfect sense in this rich setting. The goal of this paper is to discover whether the connection to maximum entropy also holds. We show that maximum entropy continues to be widely useful, covering many problems that are far outside the scope of {Paris & Vencovska, 1989; Shastri, 1989}. 36Random Worlds and Maximum Entropy On the other hand, it turns out that we cannot make this connection for our entire language. For one thing, as we hinted earlier, there are problems if we try to condition on a knowledge base that includes non-unary predicates; we suspect that maximum entropy has no role whatsoever in this case. In addition, we show that there are subtleties that arise involving the interaction between statistical information and first-order quantification. We feel that an important contribution of this paper lies in pointing out some limitations of maximum-entropymethods. The rest of this paper is organized as follows. In the next section, we discuss our formal framework {essentially, that of {Bacchus, 1990; Halpern, 1990}). We discuss the syntax and semantics of statistical assertions, issues involving \approximately equals", and define the random-worlds method formally. In Section 3 we state the basic results that connect maximum entropy to random-worlds, and in Section 4 we discuss how to use these results as effective computational procedures. In Section 5 we return to the issue of unary versus non-unary predicates, and the question of how widely applicable the principle of maximum entropy is. We conclude in Section 6 with some discussion. 2. Technical preliminaries In this section, we give the formal definition of our language and the random-worlds method. The material is largely taken from {Bacchus et al., 1994}. 2.1 The language We are interested in a formal logical language that allows us to express both statistical information and first-order information. We therefore define a statistical language L 031 , which is a variant of a language designed by Bacchus {1990}. For the remainder of the paper, let 010 be a finite first-order vocabulary, consisting of predicate and constant symbols, and let X be a set of variables. 1 Our statistical language augments standard first-order logic with a form of statistical quantifier. For a formula {x}, the term jj {x}jj x is a proportion expression. It will be interpreted as a rational number between 0 and 1, that represents the proportion of domain elements satisfying {x}. We actually allow an arbitrary set of variables in the subscript and in the formula . Thus, for example, jjChild{x; y}jj x describes, for a fixed y, the proportion of domain elements that are children of y; jjChild{x; y}jj y describes, for a fixed x, the proportion of domain elements whose child is x; and jjChild{x; y}jj x;y describes the proportionof pairs of domain elements that are in the child relation. 2 We also allow proportion expressions of the form k {x}|022{x}k x , which we call conditional proportion expressions. Suchan expression is intended to denote the proportion of domain elements satisfying from among those elements satisfying 022. Finally, any rational number is also considered to be a proportion expression, and the set of proportion expressions is closed under addition and multiplication.1. For simplicity, weassume that 010 does not contain function symbols, since these can be defined in terms of predicates. 2. Strictly speaking, these proportion expression should be written with sets of variables in the subscript, as in jjChild{x; y}jj fx;yg . However, when the interpretation is clear, we often abuse notation and drop the set delimiters. 37Grove, Halpern, & Koller One important difference between our syntax and that of {Bacchus, 1990} is the use of approximate equality to compare proportion expressions. There are both philosophical and practical reasons why exact comparisons can be inappropriate. Consider a statement such as \80045 of patients with jaundice have hepatitis". If this statement appears in a knowledge base, it is almost certainly there as a summary of a large pool of data. So it would be wrong to interpret the value too literally, to mean that exactly 80045of all patients with jaundice havehepatitis. Furthermore, this interpretation would imply {among other things} that the number of jaundiced patients is a multiple of five! This is unlikely to be something we intend. We therefore use the approach described in {Bacchus et al., 1994; Koller & Halpern, 1992}, and compare proportion expressions using {instead of = and 024} one of an infinite family of connectives 031 i and 026 i , for i = 1; 2; 3 : : : {\i-approximatelyequal" or \i-approximately less than or equal"}. For example, we can express the statement \80045 of jaundiced patients have hepatitis" by the proportion formula kHep{x}|Jaun{x}k x 031 1 0:8. Theintuition behind the semantics of approximate equality is that each comparison should be interpreted using some small tolerance factor to account for measurement error, sample variations, and so on. The appropriate tolerance will differ for various pieces of information, so our logic allows different subscripts on the \approximately equals" connectives. A formula such as kFly{x}|Bird {x}k x 031 1 1 ^ kFly{x}|Bat {x}k x 031 2 1 says that both kFly{x}|Bird{x}k x and kFly{x}|Bat {x}k x are approximately 1, but the notion of \approximately" may be different in each case. Theactual choice of subscript for 031 is unimportant. However, it is important to use different subscripts for different approximate comparisons unless the tolerances for the different measurements are known to be the same. We can now give a recursive definition of the language L 031 . Definition 2.1: The set of terms in L 031 is X [ C where C is the set of constant symbols in 010. The set of proportion expressionsis the least set that {a} contains the rational numbers, {b} contains proportion terms of the form jj jj X and k |022k X for formulas ; 022 2 L 031 and a finite set of variables X 022 X , and {c} is closed under addition and multiplication. The set of formulas in L 031 is the least set that {a} contains atomic formulas of the form R{t 1 ; : : : ; t r }, where R is a predicate symbol in 010 [ f=} of arity r and t 1 ; : : : ; t r are terms, {b} contains proportion formulas of the form 020 031 i 020 0 and020 026 i 020 0 , where 020 and 020 0 are proportion expressions and i is a natural number, and {c} is closed under conjunction, negation, and first-order quantification.Note that L 031 allows the use of equality when comparing terms, but not when comparing proportion expressions. This definition allows arbitrary nesting of quantifiers and proportion expressions. As observed in {Bacchus, 1990}, the subscript x in a proportion expressions binds the variable x in the expression; indeed, we can view jj001jj x as a new type of quantification. 38Random Worlds and Maximum Entropy We now need to define the semantics of the logic. As we shall see below, most of the definitions are fairly straightforward. The two features that cause problems are approxi- mate comparisons and conditional proportion expressions. We interpret the approximate connective 020 031 i 020 0 to mean that 020 is very close to 020 0 . Moreprecisely, it is within some very small tolerance factor. We formalize this using a tolerance vector ~034 =h034 1 ; 034 2 ; : : :i, 034 i > 0. Intuitively 020 031 i 020 0 if the values of 020 and 020 0 are within 034 i of each other. Of course, one prob- lem with this is that we generally will not know the value of 034 i . Wepostpone discussion of this issue until the next section. Another difficulty arises when interpreting conditional proportion expressions. The problem is that k |022k X cannot be defined as a conditional probability when there are no assignments to the variables in X that would satisfy 022, because we cannot divide by zero. When standard equality is used rather than approximate equality this problem is easily overcome, simply by avoiding conditional probabilities in the semantics altogether. Following {Halpern, 1990}, we can eliminate conditional proportion expressions altogether by viewing a statement such as k |022k X = ff as an abbreviation for jj ^ 022jj X = ffjj022jj X . Thus, we never actually form quotients of probabilities. This approach agrees completely with the standard interpretation of conditionals so long as jj022jj X 6= 0. If jj022jj X = 0, it enforces the convention that formulas such as k |022k X = ff or k |022k X 024 ff are true for any ff. {Note that we do not really care much what happens in such cases, so long as it is consistent and well-defined. This convention represents one reasonable choice.} We used the same approach in an earlier version of this paper {Grove, Halpern, & Koller, 1992} in the context of a language that uses approximate equality. Unfortunately, as the following example shows, this has problems. Unlike the case for true equality, if we multiply by jj022jj X to clear all quotients, we do not obtain an equivalent formula even if jj022jj X is nonzero. Example 2.2: First consider the knowledge base KB ={kFly{x}|Penguin{x}k x 031 1 0}. This says that the number of flying penguins forms a tiny proportion of all penguins. However, if we interpret conditional proportions as above and multiply out, we obtain the knowledge base KB 0 = jjFly{x} ^ Penguin{x}jj x 031 1 0 001 jjPenguin{x}jj x , which is equivalent to jjFly{x} ^ Penguin{x}jj x 031 1 0. KB 0 just says that the number of flying penguins is small, and has lost the {possibly important} information that the number of flying penguins is small relative to the number of penguins. It is quite consistent with KB 0 that all penguins fly {provided the total number of penguins is small}; this is not consistent with KB . Clearly, the process of multiplying out across an approximate connective does not preserve the intended interpretation of the formulas.This example demonstrates an undesirable interaction between the semantics we have chosenfor approximate equality and the process of multiplying-out to eliminate conditional proportions. We expect k |022k X 031 1 ff to mean that k |022k X is within some tolerance 034 1 of ff. Assuming jj022jj X > 0, this is the same as saying that jj ^ 022jj X is within 034 1 jj022jj X of ff jj022jj X . On the other hand, the expression that results by multiplying out is jj ^ 022jj X 031 1 ffjj022jj X . This says that jj ^ 022jj X is within 034 1 {not 034 1 jj022jj X !} of ff jj022jj X . As we saw above, the difference between the two interpretations can be significant. Because of this problem, we cannot treat conditional proportions as abbreviations and instead have added them as primitive expressions in the language. Of course, we now have 39Grove, Halpern, & Koller to give them a semantics that avoids the problem illustrated by Example 2.2. We would like to maintain the conventions used when we had equality in the language. Namely, in worlds where jj022{x}jj x 6= 0, we want k {x}|022{x}k x to denote the fraction of elements satisfying 022{x} that also satisfy {x}. In worlds where jj022{x}jj x = 0, we want formulas of the form k {x}|022{x}k x 031 i ff or k {x}|022{x}k x 026 i ff to be true. There are a number of ways of accomplishing this. The way we take is perhaps not the simplest, but it introduces machinery that will be helpful later. The basic idea is to make the interpretation of 031 more explicit, so that we can eliminate conditional proportions by multiplication and keep track of all the consequences of doing so. We give semantics to the language L 031 by providing a translation from formulas in L 031 to formulas in a language L = whose semantics is more easily described. The language L = is essentially the language of {Halpern, 1990}, that uses true equality rather than approximate equality when comparing proportion expressions. More precisely, the definition of L = is identical to the definition of L 031 given in Definition 2.1, except that: * we use = and 024 instead of 031 i and 026 i , * we allow the set of proportion expressions to include arbitrary real numbers {not just rational numbers}, * we do not allow conditional proportion expressions, * we assume that L = has a special family of variables " i , for i = 1; 2; : : :, interpreted overthe reals. The variable " i is used to explicitly interpret the approximate equality connectives 031 i and 026 i . Once this is done, we can safely multiply out the conditionals, as described above. More precisely, every formula 037 2 L 031 can be associated with a formula 037 003 2 L = as follows: * every proportion formula 020 026 i 020 0 in 037 is {recursively} replaced by 020 000020 0 024 " i , * every proportion formula 020 031 i 020 0 in 037 is {recursively} replaced by the conjunction {020 000 020 0 024 " i } ^ {020 0 000 020 024 " i }, * finally, conditional proportion expressions are eliminated by multiplying out. This translation allows us to embed L 031 into L = . Thus, for the remainder of the paper, weregard L 031 as a sublanguage of L = . This embedding avoids the problem encountered in Example 2.2, because when we multiply to clear conditional proportions the tolerances are explicit, and so are also multipled asappropriate. The semantics for L = is quite straightforward, and is similar to that in {Halpern, 1990}. We give semantics to L = in terms of worlds, or finite first-order models. For any natural number N , let W N consist of all worlds with domain {1; : : : ; N}. Thus, in W N , we have one world for each possible interpretation of the symbols in 010 over the domain{1; : : :; N}. Let W 003 denote [ N W N . Now, consider some world W 2 W 003 over the domain D ={1; : : :; N}, some valuation V : X ! D for the variables in X , and some tolerance vector ~034 . We simultaneously assign toeach proportion expression 020 a real number [020] {W;V;~034} and to each formula 030 a truth value 40Random Worlds and Maximum Entropy with respect to {W; V; ~034}. Most of the clauses of the definition are completely standard, so we omit them here. In particular, variables are interpreted using V , the tolerance variables " i are interpreted using the tolerances 034 i , the predicates and constants are interpreted using W , the Boolean connectives and the first-order quantifiers are defined in the standard fashion, and when interpreting proportion expressions, the real numbers, addition, multiplication, and 024 are given their standard meaning. It remains to interpret proportion terms. Recall that we eliminate conditional proportion terms by multiplying out, so that we need to deal only with unconditional proportion terms. If 020 is the proportion expression jj jj x i 1 ;:::;x i k {for i 1 < i 2 < : : : < i k }, then [020] {W;V;~034} = 1|D k | fi fi fi n {d 1 ; : : : ; d k } 2 D k : {W; V [x i 1 =d 1 ; : : : ; x i k =d k ]; ~034}|= o fi fi fi : Thus, if|D| = N , the proportion expression jj jj x i 1 ;:::;x i k denotes the fraction of the N k k-tuplesin D k that satisfy . For example, [jjChild{x; y}jj x ] {W;V;~034} is the fraction of domain elements d that are children of V {y}. Using our embedding of L 031 into L = , we now have semantics for L 031 . For 037 2 L 031 , we say that {W; V; ~034}|= 037 iff {W; V; ~034}|= 037 003 . It is sometimes useful in our future results to incorporate particular values for the tolerances into the formula 037 003 . Thus, let 037[~034 ] represent the formula that results from 037 003 if each variable " i is replaced with its value according to ~034 , that is, 034 i . 3 Typically we are interested in closed sentences, that is, formulas with no free variables. In that case, it is not hard to show that the valuation plays no role. Thus, if 037 is closed, we write {W; ~034}|= 037 rather than {W; V; ~034}|= 037. Finally, if KB and 037 are closed formulas, we write KB|= 037 if {W; ~034}|= KB implies {W; ~034}|= 037. 2.2 Degrees of belief As we explained in the introduction, we give semantics to degrees of belief by considering all worlds of size N tobe equally likely, conditioning on KB , and then checking the probability of ' over the resulting probability distribution. In the previous section, we defined what it means for a sentence 037 to be satisfied in a world of size N using a tolerance vector ~034 . Given N and ~034 , we define #worlds ~034 N {037} to be the number of worlds in W N such that {W; ~034}|= 037. Since we are taking all worlds to be equally likely, the degree of belief in ' given KB with respectto W N and ~034 is Pr ~034 N {'|KB} = #worlds ~034 N {' ^ KB}#worlds ~034 N {KB} : If #worlds ~034 N {KB} = 0, this degree of belief is not well-defined. The careful reader may have noticed a potential problem with this definition. Strictly speaking, we should write W N {010} rather than W N , since the set of worlds under consider- ation clearly depends on the vocabulary. Hence, the number of worlds in W N also depends on the vocabulary. Thus, both #worlds ~034 N {'} and #worlds ~034 N {' ^ KB} depend on the choice3. Note that some of the tolerances 034 i may be irrational; it is for this reason that we allowed irrational numbers in proportion expressions in L = . 41Grove, Halpern, & Koller of 010. Fortunately, this dependence \cancels out": If 010 0 033 010, then there is a constant c such that for all formulas 037 over the vocabulary 010, #[010 0 ]worlds ~034 N {037} = c#[010]worlds ~034 N {037}. This result, from which it follows that the degree of belief Pr ~034 N {'|KB} is independent of our choice of vocabulary, is proved in {Grove et al., 1993b}. Typically, we know neither N nor ~034 exactly. All we know is that N is\large" and that ~034 is \small". Thus, we would like to take our degree of belief in ' given KB to be lim ~034! ~ 0 lim N!1 Pr ~ 034 N {'|KB}. Notice that the order of the two limits over~034 andN is important. If the limit lim ~034! ~ 0 appeared last, then we would gain nothing by using approximate equality, since the result would be equivalent to treating approximate equality as exact equality. This definition, however, is not sufficient; the limit may not exist. We observed above that Pr ~034 N {'|KB} is not always well-defined. In particular, it may be the case that for certain values of ~034 , Pr ~034 N {'|KB} is not well-defined forarbitrarily large N . In order to deal with this problem of well-definedness, we define KB to be eventually consistent if for all sufficiently small ~034 and sufficiently large N , #worlds ~034 N {KB} > 0. Among other things, eventual consistency implies that the KB issatisfiable in finite domains of arbitrarily large size. For example, a KB statingthat \there are exactly 7 domain elements" is not eventually consistent. For the remainder of the paper, we assume that all knowledge bases are eventually consistent. In practice, we expect eventual consistency to be no harder to check than consistency. We do not expect a knowledge base to place bounds on the domain size, except when the bound is readily apparent. For those unsatisfied with this intuition, it is also possible to find formal conditions ensuring eventual consistency. For instance, it is possible to show that the following conditions are sufficient to guarantee that KB iseventually consistent: {a} KB does not use any non-unary predicates, including equality between terms and {b} KB is consistent for some domain size when all approximate comparisons are replaced by exact comparisons. Since we concentrate on unary languages in this paper, this result covers most cases of interest. Even if KB iseventually consistent, the limit may not exist. For example, it may be the case that Pr ~034 N {'|KB} oscillates between ff + 034 i and ff 000 034 i for some i as N gets large. In this case, for any particular ~034 , the limit as N grows will not exist. However, it seems as if the limit as ~034 grows small should, in this case, be ff, since the oscillations about ff go to 0. We avoid such problems by considering the lim sup and lim inf, rather than the limit. For any set S 032 IR, the infimum of S, inf S, is the greatest lower bound of S. Thelim inf of a sequence is the limit of the infimums; that is, lim inf N!1 a N = lim N!1 inf{a i : i > N}: The lim inf exists for any sequence bounded from below, even if the limit does not. The lim sup is defined analogously, where sup S denotes the least upper bound of S. If lim N!1 a N does exist, then lim N!1 a N = lim inf N!1 a N = lim sup N!1 a N . Since, for any ~034 , the sequence Pr ~034 N {'|KB} is always bounded from above and below, the lim sup and lim inf always exist. Thus, we do not have to worry about the problem of nonexistence for particular values of ~034 . We can now present the final form of our definition. Definition 2.3: If lim ~034! ~ 0 lim inf N!1 Pr ~034 N {'|KB} and lim ~034! ~ 0 lim sup N!1 Pr ~034 N {'|KB} 42Random Worlds and Maximum Entropy both exist and are equal, then the degree of belief in ' given KB , written Pr 1 {'|KB}, is defined as the common limit; otherwise Pr 1 {'|KB} does not exist. We close this section with a few remarks on our definition. First note that, even using this definition, there are many cases where the degree of belief does not exist. However, as some of our later examples show, in many situations the nonexistence of a degree of belief can be understood intuitively {for instance, see Example 4.3 and the subsequent discussion}. We could, alternatively, have taken the degree of belief to be the interval defined by lim ~034! ~ 0 lim inf N!1 Pr ~034 N {'|KB} and lim ~034! ~ 0 lim sup N!1 Pr ~034 N {'|KB}, provided each of them exist. This would have been a perfectly reasonable choice; most of the results we state would go through with very little change if we had taken this definition. Our definition simplifies the exposition slightly. Finally, we remark that it may seem unreasonable to take limits if we know the domain size or have a bound on the domain size. Clearly,if we know N and ~034 , then it seems more reasonable to use Pr ~034 N rather than Pr 1 as our degree of belief. Indeed, as shown in {Bacchus et al., 1994}, many of the important properties that hold for the degree of belief defined by Pr 1 hold for Pr ~034 N , for all choices of N and ~034 . The connection to maximum entropy that we make in this paper holds only at the limit, but because {as our proofs show} the convergence is rapid, the degree of belief Pr 1 {'|KB} is typically a very good approximation to Pr ~034 N {'|KB}, even for moderately large N and moderately small ~034 . 3. Degrees of belief and entropy 3.1 Introduction to maximum entropy The idea of maximizing entropy has played an important role in many fields, including the study of probabilistic models for inferring degrees of belief {Jaynes, 1957; Shannon & Weaver, 1949}. In the simplest setting, we can view entropy as a real-valued function on finite probability spaces. If 012 is a finite set and 026 is a probability measure on 012, the entropy H{026} is defined to be 000 P !2012 026{!} ln 026{!} {we take 0 ln 0 = 0}. One standard application of entropy is the following. Suppose we know the space 012, but have only partial information about 026, expressed in the form of constraints. For example, we might have a constraint such as 026{! 1 } + 026{! 2 } 025 1=3. Although there may be many measures 026 that are consistent with what we know, the principle of maximum entropy suggests that we adopt that 026 003 which has the largest entropy among all the consistent possibilities. Using the appropriate definitions, it can be shown that there is a sense in which this 026 003 incorporates the \least" additional information {Shannon & Weaver, 1949}. For example, if we have no constraints on 026, then 026 003 will be the measure that assigns equal probability to all elements of 012. Roughly speaking, 026 003 assigns probabilities as equally as possible given the constraints. 3.2 From formulas to constraints Likemaximum entropy, the random-worlds method is also used to determine degrees of be- lief {i.e., probabilities} relative to a knowledge base. Aside from this, is there any connection between the two ideas? Of course, there is the rather trivial observation that random-worlds considers a uniform probability distribution {over the set of worlds satisfying KB }, and it is 43Grove, Halpern, & Koller well-known that the uniform distribution over any set has the highest possible entropy. But in this section we show another, entirely different and much deeper, connection between random-worlds and the principle of maximum entropy. This connection holds provided that we restrict the knowledge base so that it uses only unary predicates and constants. Inthis casewe can consider probability distributions, and in particular the maximum-entropy dis- tribution, over the set of atoms. Atoms are of course very different from possible worlds; for instance, there are only finitely many of them {independent of the domain size N }. Furthermore, the maximum-entropy distributions we consider will typically not be uniform. Nevertheless, maximum entropy in this new space can tell us a lot about the degrees of belief defined by random worlds. In particular, this connection will allow us to use maxi- mum entropy as a tool for computing degrees of belief. We believe that the restriction to unary predicates is necessary for the connection we are about to make. Indeed, as long as the knowledge base makes use of a binary predicate symbol {or unary function symbol}, we suspect that there is no useful connection between the two approaches at all; see Section 5 for some discussion. Let L 031 1 be the sublanguage of L 031 where only unary predicate symbols and constant symbols appear in formulas; in particular, we assume that equality between terms does not occur in formulas in L 031 1 . 4 {Recall that in L 031 , we allow equality between terms, but disallow equality between proportion expressions.} Let L = 1 be the corresponding sublanguage of L = . In this subsection, we show that the expressive power of a knowledge base KB in the language L 031 1 is quite limited. In fact, such a KB can essentially only place constraints on the proportions of the atoms. If we then think of these as constraints on the \probabilities of the atoms", then we have the ingredients necessary to apply maximum entropy. In Section 3.3 weshow that there is a strong connection between the maximum-entropy distribution found this way and the degree of belief generated by random-worlds method. To see what constraints a formula places on the probabilities of atoms, it is useful to convert the formula to a certain canonical form. Asa first step to doing this, we formalize the definition of atom given in the introduction. Let P ={P 1 ; : : : ; P k } consist of the unary predicate symbols in the vocabulary 010. Definition 3.1: An atom {over P} is conjunction of the form P 0 1 {x} ^ : : : ^ P 0 k {x}, where each P 0 i is either P i or :P i . Since the variable x is irrelevant to our concerns, we typically suppress it and describe an atom as a conjunction of the form P 0 1 ^ : : : ^ P 0 k .Note that there are 2 jPj = 2 k atoms over P andthat they are mutually exclusive and exhaustive. Throughout this paper, we use K todenote 2 k and A 1 ; : : : ; A K to denote the atoms over P, listed in some fixed order. Example 3.2: There are K = 4 atoms over P = {P 1 ; P 2 }: A 1 = P 1 ^P 2 , A 2 = P 1 ^ :P 2 , A 3 = :P 1 ^ P 2 , A 4 = :P 1 ^ :P 2 .The atomic proportion terms jjA 1 {x}jj x ; : : : ; jjA K {x}jj x will play a significant role in our technical development. It turns out that L 031 1 is a rather weak language: a formula KB 2L 031 1 does little more than constrain the proportion of the atoms. In other words, for4. We remark that many of our results can be extended to the case where the KB mentions equality, but the extra complexity obscures many of the essential ideas. 44Random Worlds and Maximum Entropy any such KB we can find an equivalent formula in which the only proportion expressions are these unconditional proportions of atoms. The more complex syntactic machinery in L 031 1 |proportions over tuples, first-order quantification, nested proportions, and conditional proportions|does not add expressive power. {It does add convenience, however; knowledge can often be expressed far more succinctly if the full power of the language is used.} Given any KB , the first step towards applying maximum entropy is to use L 031 1 's lack of expressivity and replace all proportion terms by atomic proportion terms. It is also useful to make various other simplifications to KB that will help us in Section 4. We combine these steps and require that KB be transformed into a special canonical form which we now describe. Definition 3.3: An atomic term t over P is a polynomial overterms of the form jjA{x}jj x , where A is an atom over P. Such an atomic term t is positive if every coefficient of the polynomial t is positive.Definition 3.4: A {closed} sentence 037 2 L = 1 is in canonical form if it is a disjunction of conjunctions, where each conjunct is one of the following: * t 0 = 0, {t 0 > 0 ^ t 024 t 0 " i }, or {t 0 > 0 ^ :{t 024 t 0 " i }), where t and t 0 are atomic terms and t 0 is positive, * 9x A i {x} or :9x A i {x} some atom A i , or * A i {c} for some atom A i and some constant c. Furthermore, a disjunct cannot contain both A i {c} and A j {c} for i 6= j as conjuncts, nor can it contain both A i {c} and :9x A i {x}. {Note that these last conditions are simply minimal consistency requirements.}Theorem 3.5: Every formula in L = 1 is equivalent to a formula in canonical form. More- over, there is an effective procedure that, given a formula 030 2 L = 1 , constructs an equivalent formula b 030 in canonical form. The proof of this theorem, and of all theorems in this paper, can be found in the appendix. We remark that the length of the formula b 030 is typically exponential in the length of 030. Such a blowup seems inherent in any scheme defined in terms of atoms. Theorem 3.5 is a generalization of Claim 5.7.1 in {Halpern, 1990}. It, in turn, is a generalization of a well-known result which says that any first-order formula with only unary predicates is equivalent to one with only depth-one quantifier nesting. Roughly speaking, this is because for a quantified formula such as 9x 030 0 , subformulas talking about a variable y other than x can be moved outside the scope of the quantifier. This is possible because no literal subformula can talk about x and y together. Our proof uses the same idea and extends it to proportion statements. Inparticular, it shows that for any 030 2 L 031 1 there is an equivalent ^ 030 which has no nested quantifiers or nested proportions. Notice, however, that such a result does not hold once we allow even a single binary predicate in the language. For example, the formula 8y 9x R{x; y} clearly needs nested quantification because R{x; y} talks about both x and y and so must remain within the 45Grove, Halpern, & Koller scope of both quantifiers. With binary predicates, each additional depth of nesting really does add expressive power. This shows that there can be no \canonical form" theorem quite like Theorem 3.5 for richer languages. This issue is one of the main reasons why we restrict the KB to a unary language in this paper. {SeeSection 5 for further discussion.} Given any formula in canonical form we can immediately derive from it, in a syntactic manner, a set of constraints on the possible proportions of atoms. Definition 3.6: Let KB bein canonical form. Weconstruct a formula 000{KB } in the lan- guageof real closed fields {i.e., over the vocabulary{0;1; +; 002g} as follows, where u 1 ;: : : ; u K are fresh variables {distinct from the tolerance variables " j }: * we replace each occurrence of the formula A i {c} by u i > 0, * we replace each occurrence of 9x A i {x} by u i > 0 and replace each occurrence of :9x A i {x} by u i = 0, * we replace each occurrence of jjA i {x}jj x by u i .Notice that 000{KB} has two types of variables: the new variables u i that we just introduced, and the tolerance variables " i . In order to eliminate the dependence on the latter, we often consider the formula 000{KB [~034 ]} for some tolerance vector ~034 . Definition 3.7: Given a formula fl over the variables u 1 ; : : : ; u K , let Sol [fl] be the set of vectors in 001 K ={~u 2 [0; 1] K : P K i u i = 1} satisfying fl. Formally, if {a 1 ; : : :; a K } 2 001 K , then {a 1 ; : : : ; a K } 2 Sol [fl] iff {IR; V }|= fl, where V is a valuation such that V{u i } = a i .Definition 3.8: The solution space of KB given ~034 , denoted S ~034 [KB], is defined to be the closure of Sol[000{KB[~034 ]}]. 5If KB is not in canonical form, we define 000{KB} and S ~034 [KB] to be 000{ d KB } and S ~034 [ d KB], respectively, where d KB is the formula in canonical form equivalent to KB obtained by the procedure appearing in the proof of Theorem 3.5. Example 3.9: Let P be{P 1 ; P 2 }, with the atoms ordered as in Example 3.2. Consider KB =8x P 1 {x} ^ 3kP 1 {x} ^ P 2 {x}k x 026 i 1: The canonical formula d KB equivalent to KB is: 6 :9x A 3 {x} ^ :9x A 4 {x} ^ 3jjA 1 {x}jj x 000 1 024 " i : As expected, d KB constrains both jjA 3 {x}jj x and jjA 4 {x}jj x {i.e., u 3 and u 4 } to be 0. We also see that jjA 1 {x}jj x {i.e., u 1 } is {approximately} at most 1=3. Therefore: S ~034 [KB] = n {u 1 ;: : : ; u 4 } 2 001 4 : u 1 024 1=3+ 034 i =3; u 3 = u 4 = 0 o :5. Recall that the closure of a set X 022 IR K consists of all K-tuples that are the limit of a sequence of K-tuples in X. 6. Notethat here we are viewing KB asa formula in L = , under the translation defined earlier; we do this throughout the paper without further comment. 46Random Worlds and Maximum Entropy 3.3 The concentration phenomenon With every world W 2 W 003 , we can associate a particular tuple {u 1 ; : : :; u K }, where u i is the fraction of the domain satisfying atom A i in W : Definition 3.10: Givena world W 2 W 003 , we define 031{W } 2 001 K to be {jjA 1 {x}jj x ; jjA 2 {x}jj x ; : : : ; jjA K {x}jj x } where the values of the proportions are interpreted over W . We say that the vector 031{W } is the point associated with W .We define the entropy ofany model W to be the entropy of 031{W }; that is, if 031{W } = {u 1 ; : : : ; u K }, then the entropy of W is H{u 1 ;: : : ; u K }. As we are about to show, the entropy of ~u turns out to be a very good asymptotic indicator of how many worlds W there are such that 031{W } =~u. In fact, there are so many more worlds near points of high entropy that we can ignore all the other points when computing degrees of belief. This concentration phenomenon, as Jaynes {1982} has called it, is essentially the content of the next lemma andjustifies our interest in the maximum-entropy point{s} of S ~034 [KB]. For any S 022 001 K let #worlds ~034 N [S]{KB} denote the number of worlds W of size N such that {W; ~034}|= KB andsuch that 031{W } 2 S; for any ~u 2 001 K let #worlds ~034 N [~u]{KB} abbreviate #worlds ~034 N [{~u}]{KB}. Of course #worlds ~034 N [~u]{KB} is necessarily zero unless all components of ~u are multiples of 1=N . However, if there are any models associated with ~u at all, we can estimate their number quite accurately using the entropy function: Lemma 3.11: There exist some function h : IN ! IN andtwo strictly positive polynomial functions f; g : IN !IR such that, for KB 2 L 031 1 and~u 2 001 K , if #worlds ~034 N [~u]{KB} 6= 0, then {h{N }=f {N })e NH{~u} 024 #worlds ~034 N [~u]{KB} 024 h{N }g{N }e NH{~u} : Of course, it follows from the lemma that tuples whose entropy is near maximum have overwhelmingly more worlds associated with them than tuples whose entropy is further from maximum. Thisis essentially the concentration phenomenon. Lemma 3.11 is actually fairly easy to prove. The following simple example illustrates themain idea. Example 3.12: Suppose 010={P} and KB =true: We have 001 K = 001 2 ={{u 1 ; 1 000 u 1 } : 0 024 u 1 024 1} ; where the atoms are A 1 = P and A 2 = :P . For any N , partition the worlds in W N according to the point to which they correspond. For example, the graph in Figure 1 shows us the partition of W 4 . In general, consider some point~u = {r=N; {N 000 r}=N }. The number of worlds corresponding to ~u is simply the number of ways of choosing the denotation of P . We need to choose which r elements satisfy P ; hence, the number of such worlds is 000 N r 001 = N!r!{N000r}! . Figure 2 shows the qualitative behavior of this function for large values of N . It is easy to see the asymptotic concentration around ~u = {0:5; 0:5}. 47Grove, Halpern, & Koller00.250.50.751||P{x}||x#worlds4Figure 1: Partitionof W 4 according to 031{W }. 00.10.20.30.40.50.60.70.80.91Figure 2: Concentrationphenomenon for worlds of size N . We can estimate the factorials appearing in this expression using Stirling's approx- imation, which asserts that the factorial m! is approximately m m = e m ln m . So, after substituting for the three factorials, we can estimate 000 N r 001 as e N log N000{r log r+{N000r} log{N000r}) , which reduces to e NH{~u} . Theentropy term in the general case arises from the use of Stir- ling's approximation in an analogous way. {Amore careful estimate is done in the proof of Lemma 3.11 in the appendix.}48Random Worlds and Maximum Entropy Because of the exponential dependence onN times the entropy, the number of worlds associated with points of high entropy swamp all other worlds as N grows large. This concentration phenomenon, well-known in the field of statistical physics, forms the basis for our main result in this section. It asserts that it is possible to compute degrees of belief according to random worlds while ignoring all but those worlds whose entropy is near maximum. The next theorem essentially formalizes this phenomenon. Theorem 3.13: For all sufficiently small ~034 , the following is true. Let Q be the points with greatest entropy in S ~034 [KB] and let O 022 IR K be any open set containing Q. Then for all 022 2 L 031 and for lim 003 2 flim sup; lim inf} we have lim N!1 003 Pr ~034 N {022|KB } = lim N!1 003 #worlds ~034 N [O]{022 ^ KB }#worlds ~034 N [O]{KB} : We remark that this is quite a difficult theorem. We have discussed why Lemma 3.11 lets us look at models of KB whose entropy is {near} maximum. But the theorem tells us to look at the maximum-entropy points of S ~034 [KB ], which we defined using a {so far unmotivated} syntactic procedure applied to KB. It seems reasonable to expect that S ~034 [KB] should tell us something aboutmodels of KB. But making this connection precise, and in particular showing how the maximum-entropy points of S ~034 [KB ] relate to models of KB withnear- maximum entropy, is difficult. However, we defer all details of the proof of that result to the appendix. In general, Theorem 3.13 may seem to be of limited usefulness: knowing that we only have to look at worlds near the maximum-entropy point does not substantially reduce the number of worlds we need to consider. {Indeed, the whole point of the concentration phenomenon is that almost all worlds have high entropy.} Nevertheless,as the rest of this paper shows, this result can be quite useful when combined with the following two results. The first of these says that if all the worlds near the maximum-entropy points have a certain property, then we should have degree of belief 1 that this property is true. Corollary 3.14: For all sufficiently small ~034 , the following is true. Let Q be the points with greatest entropy in S ~034 [KB], let O 022 IR K be an open set containing Q, and let 022[O] 2 L = be an assertion that holds for every world W such that 031{W } 2 O. Then Pr ~034 1 {022[O]|KB} = 1: Example 3.15: For the knowledge base true in Example 3.12, it is easy to see that the maximum-entropy point is {0:5; 0:5}. Fix some arbitrary * > 0. Clearly, there is some open set O around this point such that the assertion 022 = jjP {x}jj x 2 [0:5 000 * ;0:5 + * ] holds for every world in O. Therefore, we can conclude that Pr ~034 1 {jjP {x}jj x 2 [0:5 000 * ; 0:5 + * ] |true} = 1:As we show in {Bacchus et al., 1994}, formulas 022 with degree of belief 1 can essentially be treated just like other knowledge in KB . That is, the degrees of belief relative to KB and KB ^022 will be identical {even if KB and KB ^022 are not logically equivalent}. More formally: 49Grove, Halpern, & Koller Theorem 3.16: {Bacchus et al., 1994} If Pr ~034 1 {022|KB} = 1 and lim 003 2 flim sup; lim inf}, then for any formula ': lim N!1 003 Pr ~034 N {'|KB} = lim N!1 003 Pr ~034 N {'|KB ^022}: Proof: For completeness, we repeat the proof from {Bacchus et al., 1994} here. Basic probabilistic reasoning shows that, for any N and ~034 : Pr ~034 N {'|KB} = Pr ~034 N {'|KB ^ 022} Pr ~034 N {022|KB} + Pr ~034 N {'|KB ^ :022} Pr ~034 N {:022|KB }: By assumption, Pr ~034 N {022|KB } tends to 1 when we take limits, so the first term tends to Pr ~034 N {'|KB ^ 022}. On the other hand, Pr ~034 N {:022|KB } has limit 0. BecausePr ~034 N {'|KB ^ :022} is bounded, we conclude that the second product also tends to 0. The result follows.As we shall see in the next section, the combination of Corollary 3.14 and Theorem 3.16 is quite powerful. 4. Computingdegrees of belief Although the concentration phenomenon is interesting, its application to actually computing degrees of belief may not be obvious. Since we know that almost all worlds will have high entropy, a direct application of Theorem 3.13 does not substantially reduce the number of worlds we must consider. Yet, as we show in this section, the concentration theorem can form the basis of a practical technique for computing degrees of belief in many cases. We begin in Section 4.1 by presenting the intuitions underlying this technique. In Section 4.2 we build on these intuitions by presenting results for a restricted class of formulas: those queries which are quantifier-free formulas over a unary language with a single constant symbol. In spite of this restriction, many of the issues arising in the general case can be seen here. Moreover, as we show in Section 4.3, this restricted sublanguage is rich enough to allow us to embed two well-known propositional approaches that make use of maximum entropy: Nilsson's probabilistic logic {Nilsson, 1986}and the maximum-entropy extension of * -semantics {Geffner & Pearl, 1990} due to Goldszmidt, Morris, Pearl {1990} {see also {Goldszmidt, Morris, & Pearl, 1993}). In Section 4.4, we consider whether the results for the restricted language can be extended. We show that they can, but several difficult and subtle issues arise. 4.1 The general strategy Although the random-worlds method is defined by counting worlds, we can sometimes find more direct ways to calculate the degrees of belief it yields. In {Bacchus et al., 1994} we present a number of such techniques, most of which apply only in very special cases. One of the simplest and most intuitive is the following version of what philosophers have termed direct inference {Reichenbach, 1949}. Suppose that all we know about an individual cis some assertion {c}; in other words, KB hasthe form {c} ^ KB 0 , and the constant c does not appear in KB 0 . Also suppose that KB , together with a particular tolerance ~034 , implies that k'{x}| {x}k x is in some interval [ff; fi]. It seems reasonable to argue that c is should be treated as a \typical" element satisfying {x}, because by assumption KB containsno 50Random Worlds and Maximum Entropy information suggesting otherwise. Therefore, we might hope to use the statistics directly, and conclude that Pr ~034 1 {'{c}|KB} 2 [ff; fi]: This is indeed the case, as the following theorem shows. Theorem 4.1: {Bacchus et al., 1994} Let KB be a knowledge base of the form {~c } ^ KB 0 , and assume that for all sufficiently small tolerance vectors ~034 , KB [~034]|= k'{~x}| {~ x}k ~ x 2 [ff; fi]: If no constant in ~c appears in KB 0 , in '{~x}, or in {~x}, then Pr 1 {'{~c }|KB} 2 [ff; fi] {if the degree of belief exists at all}. This result, in combination with the results of the previous section, provides us with a very powerful tool. Roughlyspeaking, we propose to use the following strategy: Thebasic concentration phenomenon says that most worlds are very similar in a certain sense. As shown in Corollary 3.14, we can use this to find some assertions that are \almost certainly" true {i.e., with degree of belief 1} even if they are not logically implied by KB . Theorem 3.16 then tells us that we can treat these new assertions as if they are in fact known with certainty. When these new assertions state statistical \knowledge", they can vastly increase our opportunities to apply direct inference. The following example illustrates this idea. Example 4.2: Consider a very simple knowledge base over a vocabulary containing the single unary predicate {P}: KB ={jjP {x}jj x 026 1 0:3}: There are two atoms A 1 and A 2 over P, with A 1 = P and A 2 = :P . The solution space of this KB given~034 isclearly S ~034 [KB ] ={{u 1 ; u 2 } 2 001 2 : u 1 024 0:3+ 034 1}: A straightforward computation shows that, for 034 1 < 0:2, this has a unique maximum-entropy point ~v = {0:3+ 034 1 ; 0:7 000 034 1 }. Now, consider the query P {c}. For all * > 0, let 022[ * ] be the formula jjP {x}jj x 2 [{0:3 + 034 1 } 000 * ; {0:3 + 034 1 } + * ]. This satisfies the condition of Corollary 3.14, so it follows that Pr ~034 1 {022[ * ]|KB} = 1: Using Theorem 3.16, we know that for lim 003 2 flim inf; lim sup}, lim N!1 003 Pr ~034 N {P {c}|KB} = lim N!1 003 Pr ~034 N {P {c}|KB ^022[ * ]}: But now we can use direct inference. {Note that here, our \knowledge" about c is vacuous, i.e., \true{c}".} We conclude that, if there is any limit at all, then necessarily Pr ~034 1 {P {c}|KB ^022[ * ]} 2 [{0:3 + 034 1 } 000 * ; {0:3 + 034 1 } + * ]: So, for all * > 0, Pr ~034 1 {P {c}|KB} 2 [{0:3+ 034 1 } 000 * ; {0:3 + 034 1 } + * ]: Since this is true for all * , the only possible value for Pr ~034 1 {P {c}|KB} is 0:3 + 034 1 , which is the value of u 1 {i.e., jjP {x}jj x } at the maximum-entropy point. Note that it is also clear what happens as ~034 tendsto ~ 0: Pr 1 {P {c}|KB} is 0:3.51Grove, Halpern, & Koller This example demonstrates the main steps of one possible strategy for computing degrees of belief. First the maximum-entropy points of the space S ~034 [KB ] are computed as a function of ~034 . Then, these are used to compute Pr ~034 1 {'|KB}, assuming the limit exists {if not, the lim sup and lim inf of Pr N {'|KB} are computed instead}. Finally, we compute the limit of this probability as ~034 goes to zero. Unfortunately, this strategy has a serious potential problem. We clearly cannot compute Pr ~034 1 {'|KB} separately for each of the infinitely many tolerance vectors ~034 andthen take the limit as ~034 goesto 0. We might hope to compute this probability as an explicit function of ~034 , and then compute the limit. For instance, in Example 4.2 Pr ~034 1 {P {c}|KB} was found to be 0:3 + 034 1 , and so it is easy to see what happens as 034 1 ! 0. But there is no reason to believe that Pr ~034 1 {'|KB} is, in general, an easily characterizable function of ~034 . If it is not, then computing the limit as~034 goes to 0 can be difficult or impossible. We would like to find a way to avoid this explicit limiting process altogether. It turns out that this is indeed possible in some circumstances. The main requirement is that the maximum-entropy points of S ~034 [KB ] converge to the maximum-entropy points of S ~ 0 [KB ]. {For future reference, notice that S ~ 0 [KB] is the closure of the solution space of the constraints obtained from KB by replacing all occurrences of 031 i by = and all occurrences of 026 i by 024.} Inmany such cases, we can compute Pr 1 {'|KB} directly in terms of the maximum-entropy points of S ~ 0 [KB], without taking limits at all. As the following example shows, this type of continuity does not hold in general: the maximum-entropy points of S ~034 [KB] do not necessarily converge to those of S ~ 0 [KB]. Example 4.3: Consider the knowledge base KB = {jjP {x}jj x 031 1 0:3 _ jjP{x}jj x 031 2 0:4} ^ jjP{x}jj x 6031 3 0:4 : It is easy to see that S ~ 0 [KB] is just{{0:3; 0:7}}: The point {0:4; 0:6} is disallowed by the second conjunct. Now, consider S ~034 [KB] for ~034 > ~ 0. If 034 2 024 034 3 , then S ~034 [KB] indeed does not contain points where u 1 is near 0:4; the maximum-entropy point of this space is easily seen to be 0:3 + 034 1 . However, if 034 2 > 034 3 then there will be points in S ~034 [KB] where u 1 is around 0:4; for instance, those where 0:4 + 034 3 < u 1 024 0:4 + 034 2 . Since these points have a higher entropy than the points in the vicinity of 0:3, the former will dominate. Thus, the set of maximum-entropy points of S ~034 [KB ] does not converge to a single well-defined set. What it converges to {if anything} depends on how ~034 goes to ~ 0. This nonconvergence has consequences for degrees of belief. It is not hard to show Pr ~034 1 {P {c}|KB} can be either 0:3 + 034 1 or 0:4 + 034 2 , depending on the precise relationship between 034 1 , 034 2 , and 034 3 . Itfollows that Pr 1 {P {c}|KB} does not exist.We say that a degree of belief Pr 1 {'|KB} is not robust if the behavior of Pr ~034 1 {'|KB} {or of liminf Pr ~034 N {'|KB} and lim sup Pr ~034 N {'|KB}) as ~034 goes to ~ 0 depends on how ~034 goes to ~ 0. In other worlds, nonrobustness describes situations when Pr 1 {'|KB} does not exist because of sensitivity to the exact choice of tolerances. We shall see a number of other examples of nonrobustness in later sections. Itmight seem that the notion of robustness is an artifact of our approach. Inparticular, it seems to depend on the fact that our language has the expressive power to say that the two tolerances represent a different degree of approximation, simply by using different subscripts 52Random Worlds and Maximum Entropy {031 2 vs.031 3 in the example}. In an approach to representing approximate equality that does not make these distinctions, we are bound to get the answer 0:3 in the example above, since then jjP {x}jj x 6031 3 0:4 really would be the negation of jjP {x}jj x 031 2 0:4. We would argue that the answer 0:3 is not as reasonable as it might at first seem. Suppose one of the two different instances of 0:4 in the previous example had been slightly different; for example, suppose we had used 0:399 rather than 0:4 in the first of them. In this case, the second conjunct is essentially vacuous, and can be ignored. The maximum-entropy point in S ~ 0 [KB] is now 0:399, and we indeed derive a degree of belief of 0:399 in P {c}. Thus, arbitrarily small changes to the numbers in the original knowledge base can cause large changes in our degrees of belief. But these numbers are almost always the result of approximate observations; this is reflected by our decision to use approximate equality rather than equality when referring to them. It does not seem reasonable to base actions on a degree of belief that can change so drastically in the face of small changes in the measurement of data. Note that, if we know that the two instances of 0:4 do, in fact, denote exactly the same number, we can represent this by using the same approximate equality connective in both disjuncts. In this case, it is easy to see that we do get the answer 0:3. A close look at the example shows that the nonrobustness arises because of the negated proportion expression jjP {x}jj x 6031 3 0:4. Indeed, we can show that if we start with a KB in canonical form that does not contain negated proportion expressions then, in a precise sense, the set of maximum-entropy points of S ~034 [KB] necessarily converges to the set of maximum-entropypoints of S ~ 0 [KB]. An argument can be made that we should eliminate negated proportion expressions from the language altogether. It is one thing to argue that sometimes we have statistical values whose accuracy we are unsure about, so that we want to make logical assertions less stringent than exact numerical equality. It is harder to think of cases in which the opposite is true, and all we know is that some statistic is \not even approximately" equal to some value. However, we do not eliminate negated proportion expressions from the language, since without them we would not be able to prove an analogue to Theorem 3.5. {They arise when we try to flatten nested proportion expressions, for example.} Instead, we have identified a weaker condition that is sufficient to prevent problems such as that seen in Example 4.3. Essential positivity simply tests that negations are not interacting with the maximum-entropy computation in a harmful way. Definition 4.4: Let 000 024 {KB[ ~ 0]} be the result of replacing each strict inequality in 000{KB[ ~ 0]} with its weakened version. More formally, we replace each subformula of the form t < 0 with t 024 0, and each subformula of the form t > 0 with t 025 0. {Recall that these are the only constraints possible in 000{KB [ ~ 0]}, since all tolerance variables " i are assigned 0.} Let S 024 ~ 0 [KB] beSol [000 024 {KB[ ~ 0]}], where we useX to denote the closure of X. Wesay that KB is essentially positive if the sets S 024 ~ 0 [KB] and S ~ 0 [KB] have the same maximum-entropy points.Example 4.5: Consider again the knowledge base KB from Example 4.3. The constraint formula 000{KB[ ~ 0]} is {after simplification}: {u 1 = 0:3_ u 1 = 0:4}^ {u 1 < 0:4_ u 1 > 0:4}: Its \weakened" version is 000 024 {KB[ ~ 0]}: {u 1 = 0:3_ u 1 = 0:4}^ {u 1 024 0:4_ u 1 025 0:4}; 53Grove, Halpern, & Koller which is clearly equivalent tou 1 = 0:3 _ u 1 = 0:4. Thus, S ~ 0 [KB ] = {{u 1 ; u 2 } 2 001 2 : u 1 024 0:3} whereas S 024 ~ 0 [KB ] = S ~ 0 [KB] [ f{0:4; 0:6}}. Since the two spaces have different maximum-entropy points, the knowledge base KB isnot essentially positive.As the following result shows, essential positivity suffices to guarantee that the maximum- entropy points of S ~034 [KB] converge to those of S ~ 0 [KB]. Proposition 4.6: Assume that KB is essentially positive and let Q be the set of maximum- entropy points of S ~ 0 [KB] {and thus also of S 024 ~ 0 [KB]}. Then for all * > 0 and all sufficiently small tolerance vectors ~034 {where \sufficiently small" may depend on * }, every maximum- entropypoint of S ~034 [KB] is within * of some maximum-entropy point in Q. 4.2 Queries for a single individual We now show how to compute Pr 1 {'|KB} for a certain restricted class of first-order for- mulas ' and knowledge bases KB. The most significantly restriction is that the query ' should be a quantifier-free {first-order} sentence over the vocabulary P [ fc}; thus, it is a queryabout a single individual, c. While this class is rather restrictive, it suffices to express many real-life examples. Moreover, it is significantly richer than the language considered by Paris and Vencovska {1989}. The following definition helps define the class of interest. Definition 4.7: A formula is essentially propositional ifit is a quantifier-free and proportion- free formula in the language L 031 {{P 1 ; : : : ; P k }} {so that, in particular, it has no constant symbols}and has only one free variable x.We say that '{c} is a simple query for KB if: * '{x} is essentially propositional, * KB isof the form {c} ^ KB 0 , where {x} is essentially propositional and KB 0 does not mention c. Thus, just as in Theorem 4.1, we suppose that {c} summarizes all that is known about c. In this section, we focus on computing the degree of belief Pr 1 {'{c}|KB} for a formula '{c} which is a simple query for KB . Note that an essentially propositional formula 030{x} is equivalent to a disjunction of atoms.For example, over the vocabulary{P 1 ; P 2 }, the formula P 1 {x} _ P 2 {x} is equivalent toA 1 {x}_A 2 {x}_A 3 {x} {where the atoms are ordered as in Example 3.2}. For an essentially propositional formula 030, we take A{030} be the {unique} set of atoms such that 030 is equivalent to W A j 2A{030} A j {x}. If we view a tuple~u 2 001 K as a probability assignment to the atoms, we can extend~u to a probability assignment on all essentially propositional formulas using this identification of an essentially propositional formula with a set of atoms: Definition 4.8: Let 030 be an essentially propositional formula. We define a function F [030] : 001 K !IR as follows: F [030] {~u} = X A j 2A{030} u j : 54Random Worlds and Maximum Entropy For essentially propositional formulas '{x} and {x} we define the {partial} function F ['j ] : 001 K ! IR to be: F ['j ] {~u} = F ['^ ] {~u}F [ ] {~u} : Note that this function is undefined when F [ ] {~u} = 0.As the following result shows, if ' is a simple query for KB {of the form {c}^KB 0 }, then all that matters in computing Pr 1 {'|KB} is F ['j ] {~u} for tuples~u of maximum entropy. Thus, in a sense, we are only using KB 0 to determine the space over which we maximize entropy. Having defined this space, we can focus on and ' in determining the degree of belief. Theorem 4.9: Suppose '{c} is a simple query for KB . For all ~034 sufficiently small, if Q is the set of maximum-entropy points in S ~034 [KB] and F [ ] {~v} > 0 for all ~v 2 Q, then for lim 003 2 flim sup; lim inf} we have lim N!1 003 Pr ~034 N {'{c}|KB} 2 " inf ~v2Q F ['j ] {~v}; sup ~v2Q F ['j ] {~v} # : The following is an immediate but important corollary of this theorem. It asserts that, if the space S ~034 [KB] has a unique maximum-entropy point, then its value uniquely determines the probability Pr ~034 1 {'{c}|KB}. Corollary 4.10: Suppose '{c} is a simple query for KB. For all ~034 sufficiently small, if ~v is the unique maximum-entropy point in S ~034 [KB] and F [ ] {~v} > 0, then Pr ~034 1 {'{c}|KB} = F ['j ] {~v}: We are interested in Pr 1 {'{c}|KB}, which means that we are interested in the limit of Pr ~034 1 {'{c}|KB} as ~034 ! ~ 0. Suppose KB isessentially positive. Then, by the results of the previous section and the continuity of F ['j ] , it is enough to look directly at the maximum- entropy points of S ~ 0 [KB]. More formally, by combining Theorem 4.9 with Proposition 4.6, we can show: Theorem 4.11: Suppose '{c} is a simple query for KB . If the space S ~ 0 [KB] has a unique maximum-entropy point~v, KB is essentially positive, and F [ ] {~v} > 0, then Pr 1 {'{c}|KB} = F ['j ] {~v}: We believe that this theorem will turn out to cover a lot of cases that occur in practice. As our examples and the discussion in the next section show, we often do get simple queries and knowledge bases that are essentially positive. Concerning the assumption of a unique maximum-entropy point, note that the entropy function is convex and so this assumption is automatically satisfied if S ~ 0 [KB] is a convex space. Recall that a space S is convex if for all~u;~ u 0 2 S, and all ff 2 [0; 1], it is also the case that ff~u + {1 000 ff}~u 0 2 S. The space S ~ 0 [KB] is surely convex if it is defined using a conjunction of linear constraints. While it is clearly possible to create knowledge bases where S ~ 0 [KB] has multiple maximum-entropy 55Grove, Halpern, & Koller points{for example, using disjunctions}, we expect that such knowledge bases arise rarely in practical applications. Perhaps the most restrictive assumption made by this theorem is the seemingly innocuous requirement that F [ ] {~v} > 0. This assumption is obviously necessary forthe theorem to hold; without it, the function F ['j ] is simply not defined. Unfortunately, we show in Section 4.4 that this requirement is, in fact, a severe one; in particular, it prevents the theorem from being applied to most examples derived from default reasoning, using our statistical interpretation of defaults {Bacchus et al., 1994}. We close this subsection with an example of the theorem in action. Example 4.12: Let the language consist of P ={Hepatitis; Jaundice; BlueEyed} and the constant Eric. Thereare eight atoms in this language. We use A P 0 1 P 0 2 P 0 3 to denote the atom P 0 1 {x}^P 0 2 {x}^P 0 3 {x}, where P 0 1 is either H {denoting Hepatitis} orH {denoting :Hepatitis}, P 0 2 is J orJ {for Jaundice and :Jaundice, respectively}, and P 0 3 is B orB {for BlueEyed and :BlueEyed , respectively}. Consider the knowledge base KB hep : 8x {Hepatitis{x} } Jaundice{x}) ^ kHepatitis{x}|Jaundice{x}k x 031 1 0:8 ^ jjBlueEyed{x}jj x 031 2 0:25 ^ Jaundice{Eric}: If we order the atoms as A HJB ,A HJB ,A HJ B ,A HJB , AH JB ,AH JB ,AHJ B ,AHJB , then it is not hard to show that 000{KB hep } is: u 3 = 0 ^ u 4 = 0 ^ {u 1 + u 2 } 024 {0:8 + " 1 }{u 1 + u 2 + u 5 + u 6 } ^ {u 1 + u 2 } 025 {0:8 000 " 1 }{u 1 + u 2 + u 5 + u 6 } ^ {u 1 + u 3 + u 5 + u 7 } 024 {0:25 + " 2 } ^ {u 1 + u 3 + u 5 + u 7 } 025 {0:25 000 " 2 } ^ {u 1 + u 2 + u 5 + u 6 } > 0: To find the space S ~ 0 [KB hep ] we simply set " 1 = " 2 = 0. Then it is quite straightforward to find the maximum-entropy point in this space, which, taking fl = 2 1:6 , is: {v 1 ; v 2 ; v 3 ; v 4 ; v 5 ; v 6 ; v 7 ; v 8 } = 022 15 + fl ; 35 + fl ; 0; 0; 14{5 + fl} ; 34{5 + fl} ; fl4{5 + fl} ; 3fl4{5 + fl} : Using ~v, we can compute various asymptotic probabilities very easily. For example, Pr 1 {Hepatitis{Eric}|KB hep } = F [HepatitisjJaundice] {~v} = v 1 + v 2v 1 + v 2 + v 5 + v 6 = 15+fl + 35+fl15+fl + 35+fl + 14{5+fl} ; 34{5+fl} = 0:8; as expected. Similarly, wecan show that Pr 1 {BlueEyed{Eric}|KB hep } = 0:25 and that Pr 1 {BlueEyed{Eric} ^ Hepatitis{Eric}|KB hep } = 0:2. Note that the first two answers also 56Random Worlds and Maximum Entropy follow from the direct inference principle {Theorem 4.1}, which happens to be applicable in this case. The third answer shows that BlueEyed andHepatitis are being treated as independent. It is a special case of a more general independence phenomenon that applies to random worlds; see {Bacchus et al., 1994, Theorem 5.27}.4.3 Probabilistic propositional logic In this section we consider two variants of probabilistic propositional logic. As the following discussion shows, both can easily be captured by our framework. The embedding we discuss uses simple queries throughout, allowing us to appeal to the results of the previous section. Nilsson {1986} considered the problem of what could be inferred about the proba- bility of certain propositions given some constraints. For example, we might know that Pr{fly|bird} 025 0:7 and that Pr{yellow} 024 0:2, and be interested in Pr{fly|bird ^yellow }. Roughly speaking, Nilsson suggests computing this by considering all probability distri- butions consistent with the constraints, and then computing the range of values given to Pr{fly|bird ^yellow } by these distributions. Formally, suppose our language consists of k primitive proposition, p 1 ; : : : ; p k . Consider the set 012 of K = 2 k truth assignments these propositions. We give semantics to probabilistic statements over this language in terms of a probability distribution 026 over the set 012 {see {Fagin, Halpern, & Megiddo, 1990} for de- tails}. Since each truth assignment ! 2 012 determines the truth value of every propositional formula fi, we can determine the probability of every such formula: Pr 026 {fi} = X !j=fi 026{!}: Clearly, we can determine whether a probability distribution 026 satisfies a set 003 of proba- bilistic constraints. The standard notion of probabilistic propositional inference would say that 003 |= Pr{fi} 2 [025 1 ; 025 2 ] if Pr 026 {fi} is within the range [025 1 ; 025 2 ] for every distribution 026 that satisfies the constraints in 003. Unfortunately, while this is a very natural definition, the constraints that one can derive from it are typically quite weak. For that reason, Nilsson suggested strengthening this no- tion of inference by applying the principle ofmaximum entropy: rather than considering all distributions 026 satisfying 003, we consider only the distribution{s} 026 003 that have the greatest entropy among those satisfying the constraints. As we now show, one implication of our results is that the random-worlds method provides a principled motivation for this introduc- tion of maximum entropy to probabilistic propositional reasoning. In fact, the connection between probabilistic propositional reasoning and random worlds should now be fairly clear: * The primitive propositions p 1 ; : : : ; p k correspond to the unary predicates P 1 ; : : : ; P k . * A propositional formula fi over p 1 ; : : : ; p k corresponds uniquely to an essentially propo- sitional formula 030 fi as follows: we replace each occurrence of the propositional symbol p i with P i {x}. * The set 003 of probabilistic constraints corresponds to a knowledge base KB 0 [003]|a constant-free knowledge base containing only proportion expressions. The correspon- dence is as follows: 57Grove, Halpern, & Koller { Aprobability expression of the form Pr{fi} appearing in 003 is replaced by the proportion expression jj030 fi {x}jj x . Similarly, a conditional probability expression Pr{fi|fi 0 }is replaced by k030 fi {x}|030 fi 0 {x}k x . { Each comparison connective = is replaced by 031 i for some i, and each 024 with 026 i . {The particular choices for the approximate equality connectives do not matter in this context.} The other elements that can appear in a proportion formula {such as rational num- bers and arithmetical connectives} remain unchanged. For example, the formula Pr{fly|bird} 025 0:7 would correspond to the proportion formula kFly{x}|Bird {x}k x 027 i 0:7. * There is a one-to-one correspondence between truth assignments and atoms: the truth assignment ! corresponds to the atom A = P 0 1 ^ : : : ^ P 0 k where P 0 i is P i if !{p i } = true and :P i otherwise. Let ! 1 ; : : :; ! K be the truth assignments corresponding to the atoms A 1 ; : : :; A K , respectively. * There is a one-to-one correspondence between probability distributions over the set 012 of truth assignments and points in 001 K . For each point ~u 2 001 K , let 026 ~u denote the corresponding probability distribution over 012, where 026 ~u {! i } = u i . Remark 4.13: Clearly, ! j |= fi iff A j 2 A{030 fi }. Therefore, for all ~u, we have F [030 fi ] {~u} = Pr 026 ~u {fi}:The following result demonstrates the tight connection between probabilistic proposi- tionalreasoning using maximum entropy and random worlds. Theorem 4.14 : Let 003 be a conjunction of constraints of the form Pr{fi|fi 0 } = 025 or Pr{fi|fi 0 } 2 [025 1 ; 025 2 ]. There is a unique probability distribution 026 003 of maximum entropy satisfying 003. Moreover, for all fi and fi 0 , if Pr 026 003 {fi 0 } > 0, then Pr 1 {030 fi {c}|030 fi 0 {c} ^ KB 0 [003]} = Pr 026 003 {fi|fi 0 }: Theorem 4.14 is an easy corollary of Theorem 4.11. To check that the preconditions of the latter theorem apply, note that the constraints in 003 are linear, and so the space S ~ 0 [KB 0 [003]] has a unique maximum-entropy point ~v. In fact, it is easy to show that 026 ~v is the {unique} maximum-entropy probability distribution over 012 satisfying the constraints 003. In addition, because there are no negated proportion expressions in 003, the formula KB=030 fi 0 {c}^ KB 0 [003] is certainly essentially positive. Most applications of probabilistic propositional reasoning consider simple constraints of the form used in the theorem, and so such applications can be viewed as very special cases of the random-words approach. In fact, this theorem is essentially a very old one. The connection between counting \worlds" and the entropy maximum in a space defined as a conjunction of linear constraints is very well-known. It has been extensively studied inthe field of thermodynamics, starting with the 19th century work of Maxwell and Gibbs. Recently, this type of reasoning has been applied to problems in an AI context by Paris and 58Random Worlds and Maximum Entropy Vencovska {1989} and Shastri {1989}. The work of Paris and Vencovska is particularly rele- vant because they also realize the necessity of adopting a formal notion of \approximation", although the precise details of their approach differ from ours. To the best of our knowledge, most of the work on probabilistic propositional reason- ing and all formal presentations of the entropy/worlds connection {in particular, those of {Paris & Vencovska, 1989; Shastri, 1989}) have limited themselves to conjunctions of lin- ear constraints. Our more general language gives us a great deal of additional expressive power. For example, it is quite reasonable to want the ability to express that properties are {approximately} statistically independent. For example, we may wish to assert that Bird {x} and Yellow{x} are independent properties by saying jjBird{x} ^ Yellow{x}jj x 031 jjBird{x}jj x 001 jjYellow{x}jj x . Clearly,such constraints are not linear. Nevertheless, our The- orem 4.11 covers such cases and much more. A version of probabilistic propositional reasoning has also been used to provide proba- bilistic semantics for default reasoning {Pearl, 1989}. Herealso, the connection to random worlds is of interest. In particular, it follows from Corollary 4.10 that the recent work of Goldszmidt, Morris, and Pearl {1990} can be embedded in the random-worlds framework. In the rest of this subsection, we explain their approach and the embedding. Consider a language consisting of propositional formulas over the propositional variables p 1 ; : : :; p k , and default rules of the form B ! C {read \B's are typically C's"}, where B and C arepropositional formulas. A distribution 026 is said to * -satisfy a default rule B ! C if 026{C|B} 025 1 000 * . In addition to default rules, the framework also permits the use of materialimplication in a rule, as in B } C. A distribution 026 is said to satisfy such a rule if 026{C|B} = 1. A parameterized probability distribution {PPD} is a collection{026 * } * >0 of probabilitydistributions over 012, parameterized by * . A PPD{026 * } * >0 * -satisfies a set R of rules if for every * , 026 * * -satisfies every default rule r 2 R and satisfies every non-default rule r 2 R. A set R of default rules * -entails B ! C iffor every PPD that * -satisfies R, lim * !0 026 * {C|B} = 1. As shown in {Geffner & Pearl, 1990}, * -entailment possesses a number of reasonable propertiestypically associated with default reasoning, including a preference for more spe- cific information. However, there are a number of desirable properties that it does not have. Among other things, irrelevant information is not ignored. {See {Bacchus et al., 1994} for an extensive discussion of this issue.} To obtain additional desirable properties, * -semantics is extended in {Goldszmidt et al., 1990} by an application of the principle of maximum entropy. Instead of considering all possible PPD's, as above, we consider only the PPD n 026 003 * ;R o * >0 such that, for each * , 026 003 * ;R has the maximum entropy among distributions that * -satisfy all the rules in R. {See {Goldszmidt et al., 1990} for precise definitions and technical details.} Note that, since the constraints used to define 026 003 * ;R are all linear, there is indeed a unique such point of maximum entropy. A rule B ! C is an ME-plausible consequence of R if lim * !0 026 003 * ;R {C|B} = 1. The notion of ME-plausible consequence is analyzed in detail in {Goldszmidt et al., 1990}, where it is shown to inherit all the nice properties of * -entailment {such as the preference for more specific information}, while successfully ignoring irrelevant information. Equally importantly, algorithms are provided for computing the ME-plausible consequences of a set of rules in certain cases. 59Grove, Halpern, & Koller Our maximum-entropy results can be used to show that the approach of {Goldszmidt et al., 1990} can be embedded in our framework in a straightforward manner. We simply translate a default rule r of the form B ! C intoa first-order default rule 022 r = def k030 C {x}|030 B {x}k x 031 1 1; as in our earlier translation of Nilsson's approach. Notethat the formulas that arise under this translation all use the same approximate equality connective 031 1 . The reason is that the approach of {Goldszmidt et al., 1990} uses the same * for all default rules. We can similarly translate a {non-default} rule r of the form B } C into a first-order constraint using universal quantification: 022 r = def 8x {030 B {x} } 030 C {x}): Under this translation, we can prove the following theorem. Theorem 4.15: Let c be a constant symbol. Using the translation described above, for a set R of defeasible rules, B ! C is an ME-plausible consequence of R iff Pr 1 030 C {c} fi fi fi fi fi 030 B {c} ^ ^ r2R 022 r ! = 1: In particular, this theorem implies that all the computational techniques and results described in {Goldszmidt et al., 1990} carry over to this special case of the random-worlds method. It also shows that random-world provides a principled justification for the approach {Goldszmidt et al., 1990} present {one which is quite different from the justification given in {Goldszmidt et al., 1990} itself}. 4.4 Beyond simple queries In Section 4.2 we restricted attention to simple queries. Our main result, Theorem 4.11, needed other assumptions as well: essentialpositivity, the existence of a unique maximum- entropy point ~v, and the requirement that F [ ] {~v} > 0. We believe that this theorem is useful in spite of its limitations, as demonstrated by the discussion in Section 4.3. Nevertheless, this result allows us to take advantage of only a small fragment of our rich language. Can we find a more general theorem? After all, the basic concentration result {Theorem 3.13} holds with essentially no restrictions. In this section we show that it is indeed possible to extend Theorem 4.11 significantly. However, there are serious limitations and subtleties. Weillustrate these problems by means of examples, and then state an extended result. Our attempt to address these problems {so far as is possible} leads to a rather com- plicated final result. In fact, the problems we discuss are as interesting and important as the theorem we actually give: they help us understand more of the limits of maximum entropy. Of course, every issue we discuss in this subsection is relatively minor compared to maximum entropy's main {apparent} restriction, which concerns the use of non-unary predicates. For the reader who is less concerned about the other, lesser, issues we remark that it is possible to skip directly to Section 5. We first consider the restrictions we placed on the KB, and show the difficulties that arise if we drop them. We start with the restriction to a single maximum-entropy point. As 60Random Worlds and Maximum Entropy theconcentration theorem {Theorem 3.13} shows, the entropy of almost every world is near maximum. But it does not follow that all the maximum-entropy points are surrounded by similar numbers of worlds. Thus, in the presence of more than one maximum-entropy point, we face the problem of finding the relative importance, or weighting, of each maximum- entropypoint. As the following example illustrates, this weighting is often sensitive to the tolerance values. For this reason, non-unique entropy maxima often lead to nonrobustness. Example 4.16: Suppose 010={P; c}, and consider the knowledge base KB = {kP {x}k x 026 1 0:3} _ {kP{x}k x 027 2 0:7}: Assume we want to compute Pr 1 {P {c}|KB}. In this case, S ~034 [KB] is {{u 1 ; u 2 } 2 001 2 : u 1 024 0:3+ 034 1 or u 1 025 0:7000 034 2 }; and S ~ 0 [KB] is {{u 1 ; u 2 } 2 001 2 : u 1 024 0:3 or u 1 025 0:7}: Note that S ~ 0 [KB] has two maximum-entropy points: {0:3; 0:7} and {0:7; 0:3}. Now consider the maximum-entropy points of S ~034 [KB] for ~034 > ~ 0. It is not hard to show that if 034 1 > 034 2 , then this space has a unique maximum-entropy point, {0:3 + 034 1 ;0:7 000 034 1 }. In this case, Pr ~034 1 {P {c}|KB} = 0:3 + 034 1 . On the other hand, if 034 1 < 034 2 , then the unique maximum-entropypoint of this space is {0:7 + 034 2 ; 0:3 000 034 2 }, in which case Pr ~034 1 {P {c}|KB} = 0:7 + 034 2 . If 034 1 = 034 2 , then the space S ~034 [KB] has two maximum-entropy points, and by symmetry we obtain that Pr ~034 1 {P {c}|KB} = 0:5. So, by appropriately choosing a sequence of tolerance vectors converging to ~ 0, we can make the asymptotic value of this fraction either 0:3, 0:5, or 0:7. ThusPr 1 {P {c}|KB} does not exist. It is not disjunctions per se that cause the problem here: if we consider instead the database KB 0 = {kP {x}k x 026 1 0:3} _ {kP {x}k x 027 2 0:6}, then there is no difficulty. There is a unique maximum-entropy point of S ~ 0 [KB 0 ]|{0:6; 0:4}|and the asymptotic probability Pr 1 {P {c}|KB 0 } = 0:6, as we would want. 7In light of this example {and many similar ones we can construct}, we continue to assume that there is a single maximum-entropy point. As we argued earlier, we expect this to be true in typical practical applications, so the restriction does not seem very serious. We now turn our attention to the requirement that F [ ] {~v} > 0. As we have already observed, this seems to be an obvious restriction to make, considering that the function F ['j ] {~v} is not defined otherwise. However, this difficulty is actually a manifestation of a much deeper problem. As the following example shows, any approach that just uses the maximum-entropypoint of S ~ 0 [KB ] will necessarily fail in some cases where F [ ] {~v} = 0. Example 4.17: Consider the knowledge base KB ={jjPenguin{x}jj x 031 1 0}^ {kFly{x}|Penguin{x}k x 031 2 0} ^ Penguin{Tweety}:7. We remark that it is also possible to construct examples of multiple maximum-entropy points by using quadratic constraints rather than disjunction. 61Grove, Halpern, & Koller Suppose we want to compute Pr 1 {Fly{Tweety}|Penguin{Tweety}). We can easily conclude from Theorem 4.1 that this degree of belief is 0, as we would expect. However, we cannot reach this conclusion using Theorem 4.11 or anything like it. For consider the maximum- entropy point of S ~ 0 [KB]. The coordinates v 1 , corresponding to Fly ^ Penguin, and v 2 , corresponding to :Fly ^Penguin, are both 0. Hence, F [Penguin] {~v} = 0, so that Theorem 4.11 does not apply. But, as we said, the problem is more fundamental. The information we need {that the proportion of flying penguins is zero} is simply not present if all we know is the maximum- entropy point ~v. We can obtain the same space S ~ 0 [KB] {and thus the same maximum- entropy point} from quite different knowledge bases. In particular, consider KB 0 which simply asserts that {jjPenguin{x}jj x 031 1 0} ^ Penguin{Tweety}. This new knowledge base tells us nothing whatsoever about the fraction of flying penguins, and in fact it is easy to show that Pr 1 {Fly{Tweety}|KB 0 } = 0:5. But of course it is impossible to distinguish this case from the previous one just by looking at~v. It follows that no result in the spirit of Theorem 4.11 {which just uses the value of~v} can be comprehensive.The example shows that the philosophy behind Theorem 4.11 cannot be extended very far, if at all: it is inevitable that there will be problems when F [ ] {~v} = 0. But it is natural to ask whether there is a different approach altogether in which this restriction can be relaxed. That is, is it possible to construct a technique for computing degrees of belief in those cases where F [ ] = 0? As we mentioned in Section 4.1, we might hope to do this by computing Pr ~034 1 {'|KB} as a function of ~034 and then taking the limit as ~034 goes to 0. In general, this seems very hard. But, interestingly, the computational technique of {Goldszmidt et al., 1990} does use this type of parametric analysis, demonstrating that things might not be so bad for various restricted cases. Another source of hope is to remember that maximum entropy is, for us, merely one tool for computing random-worlds degrees of belief. There may be other approaches that bypass entropy entirely. In particular, some of the theorems we give in {Bacchus et al., 1994} can be seen as doing this; these theorems will often apply even if F [ ] = 0. Another assumption made throughout Section 4.2 is that the knowledge base has a spe- cial form, namely {c} ^ KB 0 , where is essentially propositional and KB 0 does not contain any occurrences of c. The more general theorem we state later relaxes this somewhat, as follows. Definition 4.18: A knowledge base KB issaid to be separable with respect to query ' if it has the form ^ KB 0 , where contains neither quantifiers nor proportions, and KB 0 containsnone of the constant symbols appearing in ' or in . 8It should be clear that if a query '{c} is simple for KB {as assumed in previous subsection}, then the separability condition is satisfied. As the following example shows, if we do not assume separability, we can easily run into nonrobust behavior: Example 4.19: Consider the following knowledge base KB over the vocabulary 010 ={P; c}: {jjP {x}jj x 031 1 0:3 ^ P{c}) _ {jjP {x}jj x 031 2 0:3 ^ :P {c}):8. Clearly, since our approach is semantic, it also suffices if the knowledge base is equivalent to one of this form. 62Random Worlds and Maximum Entropy KB isnot separable with respect to the query P {c}. The space S ~ 0 [KB] consists of a unique point {0:3; 0:7}, which is also the maximum-entropy point. Both disjuncts of KB are consistent with the maximum-entropy point, so we might expect that the presence of the conjuncts P {c} and :P {c} in the disjuncts would not affect the degree of belief. That is, if it were possible to ignore or discount the role of the tolerances, we would expect Pr 1 {P {c}|KB} = 0:3. However, this is not the case. Consider the behavior of Pr ~034 1 {P {c}|KB} for ~034 > ~ 0. If 034 1 > 034 2 , then the maximum-entropy point of S ~034 [KB ] is {0:3 + 034 1 ;0:7 000 034 1 }. Now, consider some * > 0 sufficiently small so that 034 2 + * < 034 1 . By Corollary 3.14, we deduce that Pr ~034 1 {(jjP {x}jj x > 0:3 + 034 2 }| KB} = 1. Therefore, by The- orem 3.16, Pr ~034 1 {P {c}|KB} = Pr ~034 1 {P {c}| KB ^ {jjP {x}jj x > 0:3 + 034 2 }) {assuming the limit exists}. But since the newly added expression is inconsistent with the second disjunct, we obtain that Pr ~034 1 {P {c}|KB} = Pr ~034 1 {P {c}| P {c} ^ {jjP {x}jj x 031 1 0:3}) = 1, and not 0:3. On the other hand, if 034 1 < 034 2 , we get the symmetric behavior, where Pr ~034 1 {P {c}|KB} = 0. Only if034 1 = 034 2 do we get the expected value of 0:3 for Pr ~034 1 {P {c}|KB}. Clearly, by appropriately choosing a sequence of tolerance vectors converging to ~ 0, we can make the asymptotic value of this fraction any of 0, 0:3, or 1, or not exist at all. Again, Pr 1 {P {c}|KB} is not robust.We now turn our attention to restrictions on the query. In Section 4.2, we restricted to queries of the form '{c}, where '{x} is essentially propositional. Although we intend to ease this restriction, we do not intend to allow queries that involve statistical information. The following example illustrates the difficulties. Example 4.20: Consider the knowledge base KB =jjP {x}jj x 031 1 0:3 and the query ' = jjP {x}jj x 031 2 0:3. It is easy to see that the unique maximum-entropy point of S ~034 [KB] is {0:3+ 034 1 ;0:7 000 034 1 }. First suppose 034 2 < 034 1 . From Corollary 3.14, it follows that Pr ~034 1 {(jjP {x}jj x > 0:3 + 034 2 }| KB} = 1. Therefore, by Theorem 3.16, Pr ~034 1 {'|KB} = Pr ~034 1 {'|KB ^ {jjP {x}jj x > 0:3 + 034 2 }) {assuming the limit exists}. The latter expression is clearly 0. On the other hand, if 034 1 < 034 2 , then KB [~034 ]|= '[~034 ], so that Pr ~034 1 {'|KB} = 1. Thus, the limiting behavior of Pr ~034 1 {'|KB} depends on how ~034 goes to ~ 0, so that Pr 1 {'|KB} is nonrobust.The real problem here is the semantics of proportion expressions in queries. While the utilityof the 031 connective in expressing statistical information in the knowledge base should be fairly uncontroversial, its role in conclusions we might draw, such as ' in Example 4.20, is much less clear. The formal semantics we have defined requires that we consider all possible tolerances for a proportion expression in ', so it is not surprising that nonrobustness is the usual result. One might argue that the tolerances in queries should be allowed to depend more closely on tolerances of expressions in the knowledge base. It is possible to formalize this intuition, as is done in {Koller & Halpern, 1992}, to give an alternative semantics for dealing with proportion expressions in queries that often gives more reasonable behavior. Considerations of this alternative semantics would lead us too far afield here; rather, we focus for the rest of the section on first-order queries. In fact, our goal is to allow arbitrary first-order queries, even those that involve predi- cates of arbitrary arity and equality {although we still need to restrict the knowledge base to the unary language L 031 1 }. However, as the following example shows, quantifiers too can cause problems. Example 4.21: Let 010 ={P; c} and consider KB 1 = 8x :P {x}, KB 2 = jjP {x}jj x 031 1 0, and ' = 9x P {x}. It is easy to see that S ~ 0 [KB 1 ] = S ~ 0 [KB 2 ] ={{0; 1}}, and therefore the unique 63Grove, Halpern, & Koller maximum-entropy point in both is ~v = {0; 1}. However, Pr 1 {'|KB 1 } is clearly 0, whereas Pr 1 {'|KB 2 } is actually 1. To see the latter fact, observe that the vast majority of models of KB 2 around ~v actually satisfy 9x P {x}. Thereis actually only a single world associated with {0; 1} at which 9x P {x} is false. This example is related to Example 4.17, because it illustrates another case in which S ~ 0 [KB] cannot suffice to determine degrees of belief.In the case of the knowledge base KB 1 , the maximum-entropy point {0; 1} is quite misleadingabout the nature of nearby worlds. We must avoid this sort of \discontinuity" when finding the degree of belief of a formula that involves first-order quantifiers. The notion of stability defined below is intended to deal with this problem. To define it, we first need the following notion of a size description. Definition 4.22: A size description {over P} is a conjunction of K formulas: for each atomA j over P, it includes exactly one of 9x A j {x} and :9x A j {x}. For ~u 2 001 K , the size description associated with ~u, written 033{~u}, is that size description which includes :9x A i {x} if u i = 0 and 9x A i {x} if u i >0.The problems that we want to avoid occur when there is a maximum-entropy point ~v with size description 033{~v} such that in a neighborhood of ~v, most of the worlds satisfying KB are associated with other size descriptions. Intuitively, the problem with this is that the coordinates of ~v alone give us misleading information about the nature of worlds near ~v, and so about degrees of belief. 9 We give a sufficient condition which can be used to avoid this problem in the context of our theorems. This condition is effective and uses machinery {in particular, the ability to find solution spaces} that is needed to use the maximum-entropy approach in any case. Definition 4.23: Let ~v be a maximum-entropy point of S ~034 [KB]. We say that ~v is safe {withrespect to KB and ~034 } if ~v is not contained in S ~034 [KB ^ :033{~v}]. We say that KB and ~034 are stable for 033 003 if for every maximum-entropy point ~v 2 S ~034 [KB] we have that 033{~v} = 033 003 and that ~v is safe with respect to KB and ~034 .The next result is the key property of stability that we need. Theorem 4.24: If KB and~034 > ~ 0 are stable for 033 003 then Pr ~034 1 {033 003|KB} = 1. Our theorems will use the assumption that there exists some 033 003 such that, for all suf- ficiently small ~034 , KB and ~034 are stable for 033 003 . We note that this does not imply that 033 003 is necessarily the size description associated with the maximum-entropy point{s} of S ~ 0 [KB]. Example 4.25: Consider the knowledge base KB 2 in Example 4.21, and recall that ~v = {0; 1} is the maximum-entropy point of S ~ 0 [KB 2 ]. The size description 033{~v} is :9x A 1 {x} ^ 9x A 2 {x}. However the maximum-entropy point of S ~034 [KB 2 ] for ~034 >0 is actually {034 1 ; 1 000 034 1 }, so that the appropriate 033 003 for such a ~034 is9x A 1 {x} ^ 9x A 2 {x}.9. We actually conjecture that problems of this sort cannot arise in the context of a maximum-entropy point of S ~034 [KB] for ~034 > ~ 0. Moreprecisely, for sufficiently small ~034 anda maximum-entropy point ~v of S ~034 [KB] with KB 2 L 031 1 , we conjecture that Pr ~034 1 [O]{033{~ v}jKB} = 1 where O is an open set that contains ~v but no other maximum-entropy point of S ~034 [KB]. If this is indeed the case, then the machinery of stability that we are about to introduce is unnecessary, since it holds in all cases that we need it. However, we have been unable to prove this. 64Random Worlds and Maximum Entropy As we now show, the restrictions outlined above and in Section 4.1 suffice for our next result on computing degrees of belief. In order to state this result, we need one additional concept. Recall that in Section 4.2 we expressed an essentially propositional formula '{x} as a disjunction of atoms. Since we wish to also consider formulas ' using more than oneconstant and non-unary predicates, we need a richer concept than atoms. This is the motivation behind the definition of complete descriptions. Definition 4.26: Let Z be some set of variables and constants. A complete description D over 010 and Z is an unquantified conjunction of formulas such that: * For every predicate R 2 010 [ f=} of arity r and for every z i 1 ; : : : ; z i r 2 Z, D contains exactly one of R{z i 1 ; : : : ; z i r } or :R{z i 1 ; : : : ; z i r } as a conjunct. * D is consistent. 10Complete descriptions simply extend the role of atoms in the context of essentially proposi- tional formulas to the more general setting. As in the case of atoms, if we fix some arbitrary ordering of the conjuncts in a complete description, then complete descriptions are mutu- ally exclusive and exhaustive. Clearly, a formula 030 whose free variables and constants are contained in Z, and which is is quantifier- and proportion-free, is equivalent to some dis- junctionof complete descriptions over Z. For such a formula 030, let A{030} be a set of complete descriptions over Z such that 030 is equivalent tothe disjunction W D2A{030} D, where Z is the set of constants and free variables in 030. For the purposes of the remaining discussion {except within proofs}, we are interested only in complete descriptions over an empty set of variables. For a set of constants Z, we can view a description D over Z as describing the different properties of the constants in Z. In our construction, when considering a KB ofthe form ^ KB 0 which is separable with respect to a query ', we define the set Z tocontain precisely those constants in ' and in . In particular, this means that KB 0 will mention no constant in Z. A complete description D over a set of constants Z canbe decomposed into three parts: the unary part D 1 which consists of those conjuncts of D that involve unary predicates {and thus determines an atom for each of the constant symbols}, the equality part D = which consists of those conjuncts of D involving equality {and thus determines which of the constants are equal to each other}, and the non-unary part D >1 which consists of those conjuncts of D involving non-unary predicates {and thus determines the non-unary properties other than equality of the constants}. As we suggested, the unary part of such a complete description D extends the notion of \atom" to the case of multiple constants. Forthis purpose, we also extend F [A] {for an atom A} and define F [D] for a description D. Intuitively, we are treating each of the individuals as independent, so that the probability that constant c 1 satisfies atom A j 1 and that constant c 2 satisfies A j 2 is just the product of the probability that c 1 satisfies A j 1 and the probability that c 2 satisfies A j 2 . Definition 4.27 : For a complete description D without variables whose unary part is equivalent to A j 1 {c 1 } ^ : : : ^ A j m {c m } {for distinct constants c 1 ; : : :; c m } and for a point10. Inconsistency is possible because of the use of equality. For example, if D includes z 1 = z 2 as well as both R{z 1 ; z 3 } and :R{z 2 ; z 3 }, it is inconsistent. 65Grove, Halpern, & Koller ~u 2 001 K , we define F [D] {~u} = m Y `=1 u j ` :Note that F [D] is depends only on D 1 , the unary part of D. As we mentioned, we can extend our approach to deal with formulas ' that also use non-unary predicate symbols. Our computational procedure for such formulas uses the maximum-entropy approach described above combined with the techniques of {Grove et al., 1993b}. Theselatter were used in {Grove et al., 1993b} to compute asymptotic conditional probabilities when conditioning on a first-order knowledge base KB fo . The basic idea in that case is as follows: To compute Pr 1 {'|KB fo }, we examine the behavior of ' in finite models of KB fo . We partition the models of KB fo into a finite collection of classes such that ' behaves uniformly in each individual class. By this we mean that almost all worlds in the class satisfy ' or almost none do; i.e., there is a 0-1 law for the asymptotic probability of ' when we restrict attention to models in a single class. In order to compute Pr 1 {'|KB fo } we therefore identify the classes, compute the relative weight of each class {which is required because the classes are not necessarily of equal relative size}, and then decide for each class whether the asymptotic probability of ' is zero or one. It turns out that much the same ideas continue to work in this framework. In this case, the classes are defined using complete descriptions and the appropriate size description 033 003 . The main difference is that, rather than examining all worlds consistent with the knowledge base, we now concentrate on those worlds in the vicinity of the maximum-entropy points, as outlined in the previous section. It turns out that the restriction to these worlds affects very few aspects of this computational procedure. In fact, the only difference is in computing the relative weight of the different classes. Thislast step can be done using maximum entropy, using the tools described in Section 4.2. Theorem 4.28: Let ' be a formula in L 031 and let KB = ^ KB 0 be an essentially positive knowledge base in L 031 1 which is separable with respect to '. Let Z be the set of constants appearing in ' or in {so that KB 0 contains none of the constants in Z} and let 037 6= be the formula V c;c 0 2Z c 6= c 0 . Assume that there exists a size description 033 003 such that, for all ~034 > 0, KB and ~034 are stable for 033 003 , and that the space S ~ 0 [KB] has a unique maximum- entropy point ~v. Then Pr 1 {'|KB} = P D2A{ ^037 6= } Pr 1 {'|033 003 ^ D}F [D] {~v}P D2A{ ^037 6= } F [D] {~ v} if the denominator is positive. Since both ' and 033 003 ^ D are first-order formulas and 033 003 ^ D is precisely of the required form in {Grove et al., 1993b}, then Pr 1 {'|033 003 ^ D} is either 0 or 1, and we can use the algorithm of {Grove et al., 1993b} to compute this limit, in the time bounds outlined there. One corollary of the above is that the formula 037 6= holds with probability 1 given any knowledgebase KB of the form we are interested in. This corresponds to a default assump- tion of unique names, a property often considered to be desirable in inductive reasoning systems. 66Random Worlds and Maximum Entropy While this theorem does represent a significant generalization of Theorem 4.11, it still hasnumerous restrictions. There is no question that some of these can be loosened to some extent, although we have not been able to find a clean set of conditions significantly more general than the ones that we have stated. Weleave it as an open problem whether such a set of conditions exists. Ofcourse, the most significant restriction we have made is that of allowing only unary predicates in the KB . Thisissue is the subject of the next section. 5. Beyond unary predicates The random-worlds method makes complete sense for the full language L 031 {and, indeed, for even richer languages}. On the other hand, our application of maximum entropy is limited to unary knowledge bases. Is this restriction essential? While we do not have a theorem to this effect {indeed, it is not even clear what the wording of such a theorem would be}, we conjecture that it is. Certainly none of the techniques we have used in this paper can be generalized signif- icantly. One difficulty is that, once we have a binary or higher arity predicate, we see no analogue to the notion of atoms and no canonical form theorem. InSection 3.2 and in the proof of Theorem 3.5, we discuss why it becomes impossible toget rid of nested quantifiers and proportions when we have non-unary predicates. Even considering matters on a more intuitive level, the problems seem formidable. In a unary language, atoms are useful be- cause they are simple descriptions that summarize everything that might be known about a domain element in a model. But consider a language with a single binary predicate R{x; y}. Worlds over this language include all finite graphs {where we think of R{x; y} as holding if there is an edge from x to y}. In this language, there are infinitely many properties that maybe true or false about a domain element. For example, the assertions \the node x has m neighbors" are expressible in the language for each m. Thus, in order to partition the domainelements according to the properties they satisfy, we would need to define infinitely many partitions. Furthermore, it can be shown that \typically" {i.e., in almost all graphs of sufficiently greatsize} each node satisfies a different set of first-order properties. Thus, in most graphs, all the nodes are \different", so a partition of domain elements into a finite number of \atoms" makes little sense. It is very hard to see how the basic proof strat- egy we have used, of summarizing a model by listing the number of elements with various properties, can possibly be useful here. The difficulty of finding an analogue to entropy in the presence of higher-arity predicates is supported by results from {Grove et al., 1993a}. In this paper we have shown that maximum entropy can be a useful tool for computing degrees of belief in certain cases, if the KB involves only unary predicates. In{Grove et al., 1993a} we show that there can be no general computational technique to compute degrees of belief once we have non-unary predicate symbols in the KB . The problem of finding degrees of belief in this case is highly undecidable. This result was proven without statistical assertions in the language, and in fact holds for quite weak sublanguages of first-order logic. {For instance, in a language without equality and with only depth-two quantifier nesting.} So even if there is some generalized version of maximum entropy, it will either be extremely restricted in application or will be useless as a computational tool. 67Grove, Halpern, & Koller 6. Conclusion This paper has had two major thrusts. Thefirst is to establish a connection between max- imum entropy and the random-worlds approach for a significant fragment of our language, one far richer than that considered by Paris and Vencovska {1989} or Shastri {1989}. The second is to suggest that such a result is unlikely to obtain for the full language. The fact that we have a connection between maximum entropy and random worlds is significant. For one thing, it allows us to utilize all the tools that have been developed for computing maximum entropy efficiently {see {Goldman, 1987} and the further references therein}, and may thus lead to efficient algorithms for computing degrees of belief for a large class of knowledge bases. In addition, maximum entropy is known to have many attractive properties {Jaynes, 1978}. Our result shows these properties are shared by the random- worlds approach in the domain where these two approaches agree. Indeed, as shown in {Bacchus et al., 1994}, the random-worlds approach has many of these properties for the full {non-unary} language. On the other hand, a number of properties of maximum entropy, such as its dependence on the choice of language and its inability to handle causal reasoning appropriately, have been severely criticized {Pearl, 1988; Goldszmidt et al., 1990}. Not surprisingly, these criticisms apply to random worlds as well. A discussion ofthese criticisms, and whether theyreally should be viewed as shortcomings of the random-worlds method, is beyond the scope of this paper; the interested reader should consult {Bacchus et al., 1994, Section 7} for a more thorough discussion of these issues and additional references. We believe that our observations regarding the limits of the connection between the random-worlds method and maximum entropy are also significant. The question of how widely maximum entropy applies is quite important. Maximum entropy has been gaining prominence as a means of dealing with uncertainty both in AI and other areas. However, the difficulties of using the method once we move to non-unary predicates seem not to havebeen fully appreciated. In retrospect, this is not that hard to explain; in almost all applications where maximum entropy has been used {and where its application can be best justified in terms of the random-worlds method} the knowledge base is described in terms of unary predicates {or, equivalently, unary functions with a finite range}. For example, in physics applications we are interested in such predicates as quantum state {see {Denbigh &Denbigh, 1985}). Similarly, AI applications and expert systems typically use only unary predicatessuch as symptoms and diseases {Cheeseman, 1983}. We suspect that this is not an accident, and that deep problems will arise in more general cases. This poses a challenge to proponents of maximum entropy since, even if one accepts the maximum-entropy principle, the discussion above suggests that it may simply be inapplicable in a large class of interesting examples. Appendix A. Proofs for Section 3.2 Theorem 3.5: Every formula in L = 1 is equivalent to a formula in canonical form. More- over, there is an effective procedure that, given a formula 030 2 L = 1 constructs an equivalent formula b 030 in canonical form. 68Random Worlds and Maximum Entropy Proof: We show how to effectively transform 030 2 L = 1 to an equivalent formula in canonical form. We first rename variables if necessary, so that all variables used in 030 are distinct {i.e., no two quantifiers, including proportion expressions, ever bind the same variable sym- bol}. We next transform 030 into an equivalent flat formula 030 f 2 L 031 1 , where a flat formula is one where no quantifiers {including proportion quantifiers} have within their scope a constant or variable other than the variable{s} the quantifier itself binds. {Note that in this transformation we do not require that 030 be closed. Also, observe that flatness implies that there are no nested quantifiers.} We define the transformation by induction on the structure of 030. There are three easy steps: * If 030 is an unquantified formulas, then 030 f = 030. * {030 0 _ 030 00 } f = 030 0 f _ 030 00 f * {:030 0 } f = :{030 f }. All that remains is to consider quantified formulas of the form 9x 030 0 , jj030 0 jj ~x , or k030 0 |030 00 k ~x . It turns out that the same transformation works in all three cases. Weillustrate the transfor- mation by looking at the case where 030 is of the form jj030 0 jj ~x . By the inductive hypothesis, we can assume that 030 0 is flat. For the purposes of this proof, we define a basic formula to be an atomic formula {i.e., one of the form P {z}), a proportion formula, or a quantified formula {i.e., one of the form 9x 037}. Let 037 1 ;: : : ; 037 k be all basic subformulas of 030 0 that do not mention any variable in ~x. Letz be a variable or constant symbol not in ~x that is mentioned in 030 0 . Clearly z mustoccur in some basic subformula of 030 0 , say 037 0 . By the inductive hypothesis, it is easy to see that 037 0 cannot mention any variable in ~x and so, by construction, it is in {037 1 ; : : : ; 037 ` }. In other words, not only do {037 1 ; : : : ; 037 ` } not mention any variable in ~x, but they also contain all occurrences of the other variables and constants. {Notice that this argument fails if the language contains any high-arity predicates, including equality. For then 030 0 might include subformulas of the form R{x; y} or x = y, which can mix variables outside~x with those in ~x.} Now, let B 1 ;: : : ; B 2 ` be all the \atoms" over 037 1 ; : : : ; 037 ` . That is, we consider all formulas 037 0 1 ^ : : : ^037 0 ` where 037 0 i is either 037 i or :037 i . Now consider the disjunction: 2 ` _ i=1 {B i ^ jj030 0 jj ~x }: This is surely equivalent to jj030 0 jj ~x , because some B i must be true. However, if we assume that a particular B i is true, we can simplify jj030 0 jj ~x by replacing all the 037 i subformulas by trueor false, according to B i . {Note that this is allowed only because the 037 i do not mention any variable in ~x}. The result is that we can simplify each disjunct {B i ^jj030 0 jj ~x } considerably. In fact, because of our previous observation about{037 1 ; : : :; 037 `}, there will be no constants or variables outside ~x left within the proportion quantifier. This completes this step of theinduction. Since the other quantifiers can be treated similarly, this proves the flatness result. 69Grove, Halpern, & Koller It now remains to show how a flat formula can be transformed to canonical form. Sup- pose 030 2 L 031 1 is flat. Let 030 003 2 L = 1 be the formula equivalent to 030 obtained by using the translation of Section 2.1. Every proportion comparison in 030 003 is of the form t 024 t 0 " i where t and t 0 are polynomials over flat unconditional proportions. In fact, t 0 is simply a product of flat unconditional proportions {where the empty product is taken to be 1}. Note also that since we cleared away conditional proportions by multiplying by t 0 , if t 0 = 0 then so is t, and so the formula t 024 t 0 " i is automatically true. Wecan therefore replace the comparison by {t 0 = 0} _ {t 024 t 0 " i ^ t 0 > 0}. Similarly, wecan replace a negated comparison by an expression of the form :{t 024 t 0 " i } ^ t 0 > 0. The next step is to rewrite all the flat unconditional proportions in terms of atomic proportions. In any such proportion jj030 0 jj ~x , the formula 030 0 is a Boolean combination of P {x i } for predicates P 2 P and x i 2 ~x. Thus, the formula 030 0 is equivalent to a disjunction W j {A j 1 {x i 1 } ^ : : : ^ A j m {x i m }), where each A j i is an atom over P and~x ={x i 1 ; : : : ; x i m }. These disjuncts are mutually exclusive and the semantics treats distinct variables as being independent, so jj030 0 jj ~x = X j m Y i=1 jjA j i {x}jj x : We perform this replacement for each proportion expression. Furthermore, any term t 0 in an expression of the form t 024 t 0 " i will be a product of such expressions, and so will be positive. Next, we must put all pure first-order formulas in the right form. Wefirst rewrite 030 to push all negations inwards as far as possible, so that only atomic subformulas and existential formulas are negated. Next, note that since 030 is flat, each existential subformula must have the form 9x 030 0 , where 030 0 is a quantifier-free formula which mentions no constants and only the variable x. Hence, 030 0 is a Boolean combination of P {x} for predicates P 2 P. Again, the formula 030 0 is equivalent to a disjunction of atoms of the form W A2A{030} A{x}, so 9x 030 0 is equivalent to W A2A{030} 9x A{x}. We replace 9x 030 0 by this expression. Finally, we must deal with formulas of the form P {c} or :P {c} for P 2 P. This is easy: We can again replace a formula 030 of the form P {c} or :P {c} by the disjunction W A2A{030} A{c}. Thepenultimate step is to convert 030 into disjunctive normal form. This essentially brings things into canonical form. Note that since we dealt with formulas of the form :P {c} in the previous step, we do not have to deal with conjuncts of the form :A i {c}. The final step is to check that we do not have A i {c} and either :9x A i {x} or A j {c} for some j 6= i as conjuncts of some disjunct. If we do, we simply remove that disjunct.Appendix B. Proofsfor Section 3.3 Lemma3.11: There exist some function h : IN ! IN andtwo strictly positive polynomial functions f; g : IN !IR such that, for KB 2 L 031 1 and~u 2 001 K , if #worlds ~034 N [~u]{KB} 6= 0, then {h{N }=f {N })e NH{~u} 024 #worlds ~034 N [~u]{KB} 024 h{N }g{N }e NH{~u} : Proof: Tochoose a world W 2 W N satisfying KB such that 031{W } =~u, we must partition the domain among the atoms according to the proportions in ~u, and then choose an assign- ment for the constants in the language subject to the constraints imposed by KB . Finally, 70Random Worlds and Maximum Entropy even though KB mentions only unary predicates, if there are any non-unary predicates in the vocabulary we must choose a denotation for them. Suppose ~u = {u 1 ; : : : ; u K }, and let N i = u i N for i = 1; : : :; K. The number of parti- tions of the domain into atoms is 000 N N 1 ;:::;N K 001 ; each such partition completely determines the denotation for the unary predicates. We must also specify the denotations of the constant symbols. There are at most N jCj ways of choosing these. On the other hand, we know there is at least one model {W; ~034} of KB suchthat 031{W } = ~u, so there there at least one choice. In fact, there is at least one world W 0 2 W N such that {W 0 ; ~034}|= KB for each of the 000 N N 1 ;:::;N K 001 ways of partitioning the elements of the domain {and each such world W 0 is isomorphic to W }. Finally we must choose the denotation of the non-unary predicates. However, ~u does not constrain this choice and, by assumption, neither does KB . Therefore the number of such choices is some function h{N } which is independent of ~u. 11 We conclude that: h{N } N N 1 ; : : : ; N K ! 024 #worlds ~034 N [~u]{KB} 024 h{N }N jCj N N 1 ; : : : ; N K ! : It remains to estimate N N 1 ; : : : ; N K ! = N !N 1 !N 2 ! : : : N K ! : To obtain our result, we use Stirling's approximation for the factorials, which says that m! = p2031m m m e 000m {1 + O{1=m}): It follows that exist constants L; U >0 such that L m m e 000m 024 m! 024 U m m m e 000m for all m. Using these bounds, as well as the fact that N i 024 N , we get: LU K N K N N Q K i=1 e N ie N Q K i=1 N N i i 024 N !N 1 !N 2 ! : : : N K ! 024 U NL K N N Q K i=1 e N ie N Q K i=1 N N i i : Now, consider the expression common to both bounds: N N Q K i=1 e N ie N Q K i=1 N N i i = N NQ K i=1 N N i i = K Y i=1 022 NN i N i = K Y i=1 e N i ln{N=N i } = e 000N P K i=1 u i ln{u i } = e NH{~u} :11. Itis easy to verify that in fact h{N}= Y R2010000011 2 N arity{R} ; where 011 is the unary fragment of 010 and arity{R} denotes the arity of the predicate symbol R. 71Grove, Halpern, & Koller We obtain that h{N }LU K N K e NH{~u} 024 #worlds ~034 N [~u]{KB} 024 N jCj h{N } U NL K e NH{~u} ; which is the desired result.We next want to prove Theorem 3.13. To do this, it is useful to have an alternative representation of the solution space S ~034 [KB]. Towards this end, we have the following definition. Definition B.1: Let 005 ~034 N [KB] ={031{W } : W 2 W N ; {W; ~034} |= KB}. Let 005 ~034 1 [KB] be the limit of these spaces. Formally, 005 ~034 1 [KB ] ={~u : 9N 0 s.t. 8N 025N 0 9~u N 2 005 ~034 N [KB ] s.t. lim N!1 ~u N = ~u}:The following theorem establishes a tight connection between S ~034 [KB] and 005 ~034 1 [KB]. Theorem B.2: {a} For all N and ~034 , we have 005 ~034 N [KB ] 022 S ~034 [KB]. {b} For all sufficiently small ~034 , we have 005 ~034 1 [KB ] = S ~034 [KB]. Proof: Part {a} is immediate: If ~u 2 005 ~034 N [KB], then ~u = 031{W } for some W 2 W N such that {W; ~034}|= KB . It is almost immediate from the definitions that 031{W } must satisfy 000{KB [~034]}, so 031{W } 2 Sol[000{KB[~034 ]}]. Theinclusion 005 ~034 N [KB] 022 S ~034 [KB] now follows. One direction of part {b} follows immediately from part {a}. Recall that 005 ~034 N [KB ] 022 S ~034 [KB] and that the points in 005 ~034 1 [KB ] are limits of a sequence of points in 005 ~034 N [KB ]. Since S ~034 [KB] is closed, it follows that 005 ~034 1 [KB] 022 S ~034 [KB]. For the opposite inclusion, the general strategy of the proof is to show the following: {i} If ~034 is sufficiently small, then for all~u 2 S ~034 [KB ], there is some sequence of points n ~u N 0 ;~ u N 0 +1 ;~ u N 0 +2 ;~ u N 0 +3 ; : : : o 032 Sol[000{KB[~034 ]}] such that, for all N 025N 0 , the coor- dinates of ~u N are all integer multiples of 1=N and lim N!1 ~u N = ~u. {ii} if ~w 2 Sol[000{KB[~034 ]}] and all its coordinates are integer multiples of 1=N , then ~w 2 005 ~034 N [KB ]. This clearly suffices to prove that ~u 2 005 ~034 1 [KB ]. We begin with the proof of {ii}, which is straightforward. Suppose the point ~w = {r 1 =N; r 2 =N; : : : ; r K =N } is in Sol[000{KB [~034]}]. We construct a world W 2 W N such that031{W } = ~w as follows. The denotation of atom A 1 is the set of elements{1; : : :; r 1 }, the denotation of atom A 2 is the set{r 1 + 1; : : : ; r 1 + r 2}, and so on. It remains to choose the denotations of the constants {since the denotation of the predicates of arity greater than 1 is irrelevant}. Without loss of generality we can assume KB isin canonical form. {If not, we consider d KB .} Thus, KB is a disjunction of conjunctions, say W j 030 j . Since ~w 2 Sol[000{KB [~034]}], we must have ~w 2 Sol[000{030 j [~034 ]}] for some j. We use 030 j to define the properties of the constants. If 030 j contains A i {c} for some atom A i , then we make c satisfy 72Random Worlds and Maximum Entropy A i . Note that, by Definition 3.6, if 030 j has such a conjunct then u i > 0. If 030 j contains no atomic conjunct mentioning the constant c, then we make c satisfy A i for some arbitrary atomwith u i > 0. It should now be clear that {W; ~034} satisfies 030 j , and so satisfies KB . Note that in this construction it is important that we started with ~w in Sol[000{KB[~034 ]}], rather than just in the closure space S ~034 [KB]; otherwise, the point would not necessarily satisfy 000{KB [~034]}. We now consider condition {i}. Thisis surprisingly difficult to prove; the proof involves techniques from algebraic geometry. Ourjob would be relatively easy if Sol[000{KB[~034 ]}] were an open set. Unfortunately, it is not. On the other hand, it would behave essentially like an open set if we could replace the occurrences of 024 in 000{KB [~034]} by <. It turns out that, for our purposes here, this replacement is possible. Let 000 < {KB[~034 ]} be the same as 000{KB [~034 ]} except that every {unnegated} conjunct of the form {t 024 034 i t 0 } is replaced by {t < 034 i t 0 }. {Notice that this is essentially the opposite transformation to the one used when defining essential positivity in Definition 4.4.} Finally, let S <~034 [KB] beSol[000 < {KB [~034]}]. It turns out that, for all sufficiently small ~034 , S <~034 [KB] = S ~034 [KB]. This result, which we label as Lemma B.5, will be stated and proved later. For now we use the lemma to continue the proof of the main result. Consider some ~u 2 S ~034 [KB]. It suffices to show that for all ffi > 0 there exists N 0 such that for all N > N 0 , there exists a point~u N 2 Sol [000 < {KB[~034 ]}] such that all the coordinates of ~u N are integer multiples of 1=N and such that|~u000~ u N | < ffi. {For then we can take smaller and smaller ffi's to create a sequence~u N converging to ~u.} Hence, let ffi > 0. By Lemma B.5, we can find some ~u 0 2 Sol[000 < {KB [~034]}] such that|~u000~ u 0 | < ffi=2. By definition, every conjunct in 000 < {KB[~034 ]} is of the form q 0 { ~w} = 0, q 0 { ~w} > 0, q{ ~w} < 034 i q 0 { ~w}, or q{ ~w} > 034 i q 0 { ~w}, where q 0 is a positive polynomial. Ignore for the moment the constraints of the form q 0 { ~w} = 0, and consider the remaining constraints that ~u 0 satisfies. Theseconstraints all involve strict inequalities, and the functions involved {q and q 0 } are continuous. Thus, there exists some ffi 0 > 0 such that for all ~w for which |~u 0 000 ~w| < ffi 0 , these constraints are also satisfied by ~w. Now consider a conjunct of the form q 0 {~w} = 0 that is satisfied by ~u 0 . Since q 0 is positive, this happens if and only if the following condition holds: for every coordinate w i that actually appears in q 0 , we have u 0 i = 0. In particular, if ~w and ~u 0 have the same coordinates with value 0, then q 0 { ~w} = 0. It follows that for all ~w, if|~u 0 000 ~w| < ffi 0 and ~u 0 and ~w have the same coordinates with value 0, then ~w also satisfies 000 < {KB[~034 ]}. We now construct ~u N that satisfies the requirements. Let i 003 be the index of that component of ~u 0 with the largest value. We define ~u N by considering each of its components u N i , for 1 024 i 024 K: u N i = 8 > < > : 0 u 0 i = 0 dN u 0 i e=N i 6= i 003 and u 0 i > 0 u N i 000 P j6=i 003 {u N j 000 u 0 j } i = i 003 : It is easy to verify that the components of ~u N sum to 1. All the components in ~u 0 , other than the i 003 'th, are increased by at most 1=N . The component u N i 003 is decreased by at most K=N . Wewill show that ~u N has the right properties for all N >N 0 , where N 0 is such that 1=N 0 < min{u i 003 ; ffi=2; ffi 0 }=2K. The fact that K=N 0 < u i 003 guarantees that ~u N is in 001 K for all N >N 0 . The fact that 2K=N 0 < ffi=2 guarantees that ~u N is within ffi=2 of ~u 0 , and hence within ffi of ~u. Since 2K=N 0 < ffi 0 , it follows that|~u 0 000 ~u N | < ffi 0 . Since ~u N is constructed 73Grove, Halpern, & Koller to have exactly the same 0 coordinates as ~u 0 , we conclude that ~u N 2 Sol [000 < {KB[~034 ]}], as required. Condition {i}, and hence the entire theorem, now follows.It now remains to prove Lemma B.5, which was used in the proof just given. As we hinted earlier, this requires tools from algebraic geometry. We base our definitions on the presentation in {Bochnak, Coste, & Roy, 1987}. A subset A of IR ` is said to be semi-algebraic if it is definable in the language of real-closed fields. That is, A is semi-algebraic if there is a first-order formula '{x 1 ; : : :; x ` } whose free variables are x 1 ; : : :; x ` and whose only non- logical symbols are 0, 1, +, 002, < and =, such that IR|= '{u 1 ; : : : ; u ` } iff {u 1 ; : : : ; u ` } 2 A. 12 A function f : X ! Y , where X 022 IR h and Y 022 IR ` , is said to be semi-algebraic if its graph {i.e.,{{~u; ~w} : f {~u} = ~w}} is semi-algebraic. The main tool we use is the following Curve SelectionLemma {see {Bochnak et al., 1987, p. 34}): Lemma B.3: Suppose that A is a semi-algebraic set in IR ` and ~u 2A. Then there exists a continuous, semi-algebraic function f : [0; 1] ! IR ` such that f {0} = ~u and f {t} 2 A for all t 2 {0; 1]. Our first use of the Curve Selection Lemma is in the following, which says that, in a certain sense, semi-algebraic functions behave \nicely" near limits. The type of phenomenon we wish to avoid is illustrated by x sin 1x which is continuous at 0, but has infinitely many local maxima and minima near 0. Proposition B.4: Suppose that g : [0; 1] ! IR is a continuous, semi-algebraic function such that g{u} > 0 if u > 0 and g{0} = 0. Then there exists some * > 0 such that g is strictlyincreasing in the interval [0; * ]. Proof: Suppose,by way of contradiction, that g satisfies the hypotheses of the proposition but there is no * such that g is increasing in the interval [0; * ]. Wedefine a point u in [0; 1] to be bad iffor some u 0 2 [0; u} we have g{u 0 } 025 g{u}. Let A be the set of all the bad points. Since g is semi-algebraic so is A, since u 0 2 A iff 9u 0 {(0 024 u 0 < u} ^ {g{u} 024 g{u 0 })}: Since, by assumption, g is not increasing in any interval [0; * ], we can find bad points arbitrarily close to 0 and so 0 2A. By the Curve Selection Lemma, there is a continuous semi-algebraic curve f : [0; 1] ! IR such that f {0} = 0 and f {t} 2 A for all t 2 {0; 1]. Because of the continuity of f , the range of f , i.e., f {[0; 1]}, is [0; r] for some r 2 [0; 1]. By the definition of f , {0; r] 022 A. Since062 A, it follows that f {1} 6= 0; therefore r > 0 and so, by assumption, g{r} > 0. Since g is a continuous function, it achieves a maximum v > 0 over the range [0; r]. Consider the minimum point in the interval where this maximum is achieved. More precisely, let u be the infimum of the set {u 0 2 [0; r] : g{u 0 } = v}; clearly, g{u} = v. Sincev > 0 we obtain that u > 0 and therefore u 2 A. Thus, u is bad. Butthat means that there is a point u 0 < u for which g{u 0 } 025 g{u}, which contradicts the choice of v and u.We can now prove Lemma B.5. Recall, the result we need is as follows.12. In {Bochnak et al., 1987}, a set is taken to be semi-algebraic if it is definable by a quantifier-free formula in the language of real closed fields. However, as observed in {Bochnak et al., 1987}, since the theory of real closed fields admits elimination of quantifiers {Tarski, 1951}, the two definitions are equivalent. 74Random Worlds and Maximum Entropy Lemma B.5: Forall sufficiently small ~034 , S <~034 [KB] = S ~034 [KB ]. Proof: Clearly S <~034 [KB] 022 S ~034 [KB]. To prove the reverse inclusion weconsider d KB, a canonical form equivalent of KB . We consider each disjunct of d KB separately. Let 030 be a conjunction that is one of the disjuncts in d KB. It clearly suffices to show that Sol[000{030[~034]}] 022 S <~034 [030] =Sol[000 < {030[~034]}]. Assume, by way of contradiction, that for arbitrarily small ~034 , there exists some ~u 2 Sol[000{030[~034 ]}] which is \separated" from the set Sol [000 < {030[~034]}], i.e., is not in its closure. More formally, we say that ~u is ffi-separated from Sol[000 < {030[~034]}] if there is no ~u 0 2 Sol[000 < {030[~034]}] such that|~u 000~u 0 | < ffi. We now consider those ~034 andthose points in Sol[000{030[~034]}] that are separated from Sol[000 < {030[~034]}]: 13 A ={{~034;~u; ffi} : ~034 > ~ 0; ffi > 0; ~u 2 Sol[000{030[~034 ]}] is ffi-separated from Sol[000 < {030[~034 ]}]}: Clearly A is semi-algebraic. By assumption, there are points in A for arbitrarily small tolerance vectors ~034 . Since A is a bounded subset of IR m+K+1 {where m is the number of tolerance values in ~034 }, we can use the Bolzano{Weierstrass Theorem to conclude that this set of points has an accumulation point whose first component is ~ 0. Thus,there is a point { ~ 0; ~w; ffi 0 } inA. By the Curve Selection Lemma, there is a continuous semi-algebraic function f : [0;1] ! IR m+K+1 such that f {0} = { ~ 0; ~w; ffi 0 } and f {t} 2 A for t 2 {0; 1]. Since f issemi-algebraic, it is semi-algebraic in each of its coordinates. By Lemma B.4, there is some v > 0 such that f isstrictly increasing in each of its first m coordinates over the domain [0; v]. Suppose that f {v} = {~034 ;~u; ffi}. Now, consider the constraints in 000{030[~034 ]} that have the form q{ ~w} > 034 j q 0 { ~w}. These constraints are all satisfied by ~u and they all involve strong inequalities. By the continuity of the polynomials q and q 0 , there exists some * > 0 such that, for all ~u 0 such that|~u 000~u 0 | < * , ~u 0 also satisfies these constraints. Now, by the continuity of f , there exists a point v 0 2 {0; v} sufficiently close to v such that if f {v 0 } = {~034 0 ;~u 0 ; ffi 0 }, then |~u 000 ~u 0 | < min{ffi; * }. Since f {v} = {~034;~u; ffi} 2 A and|~u000~ u 0 | < ffi, it follows that ~u 0 62 Sol[000 < {030[~034]}]. We conclude the proof by showing that this is impossible. That is, we show that~u 0 2 Sol [000 < {030[~034]}]. The constraints appearing in 000 < {030[~034 ]} can be of the following forms: q 0 { ~w} = 0, q 0 { ~w} > 0, q{ ~w} < 034 j q 0 { ~w}, or q{ ~w} > 034 j q 0 { ~w}, where q 0 is a positive polynomial. Since f {v 0 } 2 A, we know that ~u 0 2 Sol[000{030[~034 0 ]}]. The constraints of the form q 0 { ~w} = 0 and q 0 { ~w} > 0 are identical in 000{030[~034 0 ]} and in 000 < {030[~034]}, and are therefore satisfied by ~u 0 . Since|~u 0 000~ u| < * , our discussion in the previous paragraph implies that the constraints of the form q{ ~w} > 034 j q 0 { ~w} are also satisfied by ~u 0 . Finally, consider a constraint of the form q{ ~w} < 034 j q 0 { ~w}. The corresponding constraint in 000{030[~034 0 ]} is q{ ~w} 024 034 0 j q 0 { ~w}. Since ~u 0 satisfies this latter constraint, we know that q{~u 0 } 024 034 0 j q 0 {~u 0 }. But now, recall that we proved that f is increasing over [0; v] in the first m coordinates. In particular, 034 0 j < 034 j . By the definition of canonical form, q 0 {~u 0 } > 0, so that we conclude q{~u 0 } 024 034 0 j q 0 {~u 0 } < 034 j q 0 {~u 0 }. Hence the constraints of this type are also satisfied by ~u 0 . This concludes the proof that ~u 0 2 Sol[000 < {KB[~034 ]}], thus deriving a contradiction and proving theresult.We are finally ready to prove Theorem 3.13.13. Weconsider only those components in the infinite vector ~034 thatactually appear in Sol[000{030[~034]}]. 75Grove, Halpern, & Koller Theorem 3.13: For all sufficiently small ~034 , the following is true. Let Q be the points with greatest entropy in S ~034 [KB] and let O 022 IR K be any open set containing Q. Then for all 022 2 L 031 and for lim 003 2 flim sup; lim inf} we have lim N!1 003 Pr ~034 N {022|KB } = lim N!1 003 #worlds ~034 N [O]{022 ^ KB }#worlds ~034 N [O]{KB} : Proof: Let ~034 besmall enough so that Theorem B.2 applies and let Q and O be as in the statement of the theorem. It clearly suffices to show that the set O contains almost all of the worlds that satisfy KB . More precisely, the fraction of such worlds that are in O tends to 1 as N !1: Let 032 be the entropy of the points in Q. We begin the proof by showing the existence of 032 L < 032 U {< 032} such that {for sufficiently large N } {a} every point~u 2 005 ~034 N [KB] where ~u 62 O has entropy at most 032 L and {b} there is at least one point~u 2 005 ~034 N [KB] with~u 2 O and entropy at least 032 U . For part {a}, consider the space S ~034 [KB] 000 O. Since this space is closed, the entropy function takes on a maximum value in this space; let this be 032 L . Since this space does not include any point with entropy 032 {these are all in Q 022 O}, we must have 032 L < 032. ByTheorem B.2, 005 ~034 N [KB ] 022 S ~034 [KB]. Therefore, for any N , the entropy of any point in 005 ~034 N [KB] 000 O is at most 032 L . For part {b}, let 032 U be some value in the interval {032 L ; 032} {for example {032 L + 032}=2} and let ~v be any point in Q. By the continuity of the entropy function, there exists some ffi > 0 such that for all~u with|~u 000 ~v| < ffi, we have H{~u} 025 032 U . Because O is open we can, by considering a smaller ffi if necessary, assume that|~u 000 ~v| < ffi implies ~u 2 O. By the second partof Theorem B.2, there is a sequence of points ~u N 2 005 ~034 N [KB ] such that lim N!1 ~u N = ~v. In particular, for N large enough we have|~u N 000 ~v| < ffi, so that H{~u N } > 032 U , proving part {b}. To complete the proof, we use Lemma 3.11 to conclude that for all N , #worlds ~034 N {KB} 025 #worlds ~034 N [~u N ]{KB} 025 {h{N }=f {N })e NH{~u N } 025 {h{N }=f {N })e N032 U : On the other hand, #worlds ~034 N [001 K 000 O]{KB} 024 X ~u2005 ~ 034 N [KB]000O #worlds ~ 034 N [~u]{KB} 024 jf ~w 2 005 ~034 N [KB] : ~w 62 Ogj h{N }g{N }e N032 L 024 {N + 1} K h{N }g{N }e N032 L : Therefore the fraction of models of KB whichare outside O is at most {N +1} K h{N }f {N }g{N}e N032 Lh{N }e N032 U = {N +1} K f {N }g{N }e N{032 U 000032 L } : Since {N + 1} k f {N }g{N } is a polynomial in N , this fraction tends to 0 as N grows large. The result follows.76Random Worlds and Maximum Entropy Appendix C. Proofs for Section 4 Proposition4.6: Assume that KB isessentially positive and let Q be the set of maximum- entropy points of S ~ 0 [KB] {and thus also of S 024 ~ 0 [KB]}. Then for all * > 0 and all sufficiently small tolerance vectors ~034 {where \sufficiently small" may depend on * }, every maximum- entropypoint of S ~034 [KB] is within * of some maximum entropy-point in Q. Proof: Fix * > 0. By way of contradiction, assume that that there is some sequence of tolerance vectors ~034 m , m = 1; 2; : : :, that converges to ~ 0, and for each m a maximum- entropy point ~u m of S ~034 m [KB] such that for all m, ~u m is at least * away from Q. Since the space 001 K is compact, we can assume without loss of generality that this sequence convergesto some point ~u. Recall that 000{KB } is a finite combination {using \and" and \or"} of constraints, where every such constraint is of the form q 0 { ~w} = 0, q 0 { ~w} > 0, q{ ~w} 024 " j q 0 { ~w}, or q{ ~w} > " j q 0 { ~w}, such that q 0 is a positive polynomial. Since the overall number of constraints is finite we can assume, again without loss of generality, that all the ~u m 's satisfy precisely the same constraints. We claim that the corresponding conjuncts in 000 024 {KB[ ~ 0]}are satisfied by ~u. For a conjunct of the form q 0 { ~w} = 0 note that, if q 0 {~u m } = 0 for all m, then this also holds at the limit, so that q{~u} = 0. A conjunct of the form q 0 {~w} > 0 translates into q 0 { ~w} 025 0 in 000 024 {KB [ ~ 0]}; such conjuncts are trivially satisfied by any point in 001 K . If a conjunct of the form q{ ~w} 024 " j q 0 { ~w} is satisfied for all~u m and ~034 m , then at the limit we have q{~u} 024 0, which is precisely the corresponding conjunct in 000 024 {KB [ ~ 0]}. Finally, for a conjunct of the form q{ ~w} > " j q 0 { ~w}, if q{~u m } > 034 m j q 0 {~u m } for all m, then at the limit we have q{~u} 025 0, which again is the corresponding conjunct in 000 024 {KB [ ~ 0]}. It follows that ~u is in S 024 ~ 0 [KB ]. By assumption, all points ~u m are at least * away from Q. Hence, ~u cannot be in Q. If we let 032 represent the entropy of the points in Q, since Q is the set of all maximum- entropy points in S 024 ~ 0 [KB ], it follows that H{~u} < 032. Choose 032 L and 032 U such that H{~u} < 032 L < 032 U < 032. Since the entropy function is continuous, we know that for sufficiently large m, H{~u m } 024 032 L . Since ~u m is a maximum-entropy point of S ~034 m [KB], it follows that the entropy achieved in this space for sufficiently large m is at most 032 L . We derive a contradiction by showing that for sufficiently large m, there is some point in Sol[000{KB[~034 m ]}] with entropy at least 032 U . The argument is as follows. Let ~v be some point in Q. Since ~v is a maximum-entropy point of S ~ 0 [KB], there are points in Sol[000{KB[ ~ 0]}] arbitrarily close to ~v. In particular, there is some point ~u 0 2 Sol[000{KB [ ~ 0]}] whose entropy is at least 032 U . As we now show, this point is also in Sol[000{KB[~034 ]}] for all sufficiently small ~034 . Again, consider all the conjuncts in 000{KB[ ~ 0]} satisfied by ~u 0 and the corresponding conjuncts in 000{KB [~034]}. Conjuncts of the form q 0 { ~w} = 0 and q 0 { ~w} > 0 in 000{KB[ ~ 0]} remain unchanged in 000{KB[~034 ]}. Conjuncts of the form q{ ~w} 024 034 j q 0 { ~w} in 000{KB [~034]} are certainly satisfied by ~u 0 , since the corresponding conjunct in 000{KB[ ~ 0]}, namely q{ ~w} 024 0, is satisfied by ~u 0 , so that q{~u 0 } 024 0 024 034 j q 0 {~u 0 } {recall that q 0 is a positive polynomial}. Finally, consider a conjunctin 000{KB[~034 ]} of the form q{ ~w} > 034 j q 0 { ~w}. Thecorresponding conjunct in 000{KB[ ~ 0]} is q{ ~w} > 0. Suppose q{~u 0 } = ffi > 0. Since the value of q 0 is bounded over the compact space001 K , it follows that for all sufficiently small 034 j , 034 j q 0 {~u 0 } < ffi. Thus, q{~u 0 } > 034 j q 0 {~u 0 } for all sufficiently small 034 j , as required. It follows that ~u 0 is in Sol[000{KB [~034]}] for all sufficiently small ~034 and,in particular, in Sol[000{KB[~034 m ]}] for all sufficiently large m. But H{~u 0 } 025 032 U , whereas we showed that the maximum entropy achieved in S ~034 m [KB ] is at most 032 L < 032 U . 77Grove, Halpern, & Koller This contradiction proves that our assumption was false, so that the conclusion of the proposition necessarily holds.Theorem 4.9: Suppose '{c} is a simple query for KB. For all ~034 sufficiently small, if Q is the set of maximum-entropy points in S ~034 [KB] and F [ ] {~v} > 0 for all ~v 2 Q, then for lim 003 2 flim sup; lim inf} we have lim N!1 003 Pr ~034 N {'{c}|KB} 2 " inf ~v2Q F ['j ] {~v}; sup ~v2Q F ['j ] {~v} # : Proof: LetW 2 W 003 , and let ~u = 031{W }. The value of the proportion expression jj {x}jj x at W is clearly X A j 2A{ } jjA j {x}jj x = X A j 2A{ } u j = F [ ] {~u}: If F [ ] {~u} > 0, then by the same reasoning we conclude that the value of k'{x}| {x}k x at W is equal to F ['j ] {~u}. Now, let 025 L and 025 R be inf ~v2Q F ['j ] {~v} and sup ~v2Q F ['j ] {~v} respectively; by our as- sumption, F ['j ] {~v} is well-defined forall ~v 2 Q. Since the denominator is not 0, F ['j ] is a continuous function at each maximum-entropy point. Thus, since F ['j ] {~v} 2 [025 L ; 025 R ] for all maximum-entropy points, the value of F ['j ] {~u} for ~u \close" to some ~v 2 Q, will either be in the range [025 L ; 025 U ] or very close to it. More precisely, choose any * > 0, and define 022[ * ] to be the formula k'{x}| {x}k x 2 [025 L 000 * ; 025 U + * ]: Since * > 0, it is clear that there is some sufficiently small open set O around Q such that this proportion expression is well-defined and within these bounds at all worlds in O. Thus, by Corollary 3.14, Pr ~034 1 {022[ * ]|KB} = 1. Using Theorem 3.16, we obtain that for lim 003 as above, lim N!1 003 Pr ~034 N {'{c}|KB} = lim N!1 003 Pr ~034 N {'{c}|KB ^ 022[ * ]}: But now we can use the direct inference technique outlined earlier. We are interested in theprobability of '{c}, where the only information we have about c in the knowledge base is {c} and where we have statistics for k'{x}| {x}k x . These are precisely the conditions under which Theorem 4.1 applies. We conclude that lim N!1 003 Pr ~034 N {'{c}|KB} 2 [025 L 000 * ; 025 U + * ]: Since this holds for all * > 0, it is necessarily the case that lim N!1 003 Pr ~034 N {'{c}|KB} 2 [025 L ; 025 U ]; as required.Theorem 4.11: Suppose '{c} is a simple query for KB . If the space S ~ 0 [KB ] has a unique maximum-entropy point~v, KB is essentially positive, and F [ ] {~v} > 0, then Pr 1 {'{c}|KB} = F ['j ] {~v}: 78Random Worlds and Maximum Entropy Proof: Note that the fact that S ~ 0 [KB] has a unique maximum-entropy point does not guarantee that this is also the case for S ~034 [KB]. However, Proposition 4.6 implies that the maximum-entropy points of the latter space are necessarily close to~v. More precisely, if we choose some * > 0, we conclude that for all sufficiently small ~034 , all the maximum- entropy points of S ~034 [KB] will be within * of ~v. Now, pick some arbitrary ffi > 0. Since F [ ] {~v} > 0, it follows that F ['j ] is continuous at ~v. Therefore, there exists some * > 0 such that if~u is within * of~v, F ['j ] {~u} is within ffi of F ['j ] {~v}. In particular, this is the case for all maximum-entropy points of S ~034 [KB] for all sufficiently small ~034 . This allows us to apply Theorem 4.9 and conclude that for all sufficiently small ~034 andfor lim 003 2 flim sup; lim inf}, lim 003 N!1 Pr ~034 N {'{c}|KB} is within ffi of F ['j ] {~v}. Hence, this is also the casefor lim ~034! ~ 0 lim 003 N!1 Pr ~ 034 N {'{c}|KB}. Since this holds for all ffi > 0, it follows that lim ~034! ~ 0 lim inf N!1 Pr ~034 N {'{c}|KB} = lim ~034! ~ 0 lim sup N!1 Pr ~034 N {'{c}|KB} = F ['j ] {~v}: Thus, by definition, Pr 1 {'{c}|KB} = F ['j ] {~v}.Theorem 4.14: Let 003 be a conjunction of constraints of the form Pr{fi|fi 0 } = 025 or Pr{fi|fi 0 } 2 [025 1 ; 025 2 ]. There is a unique probability distribution 026 003 of maximum entropy satisfying 003. Moreover, for all fi and fi 0 , if Pr 026 003 {fi 0 } > 0, then Pr 1 {030 fi {c}|030 fi 0 {c} ^ KB 0 [003]} = Pr 026 003 {fi|fi 0 }: Proof: Clearly, the formulas '{x} = 030 fi {x} and {x} = 030 fi 0 {x} are essentially propositional. The knowledge base KB 0 [003] is in the form of a conjunction of simple proportion formulas, none of which are negated. As a result, the set of constraints associated with KB = {c} ^ KB 0 [003] also has a simple form. KB 0 [003] generates a conjunction of constraints which can be taken as having the form q{ ~w} 024 " j q 0 { ~w}. On the other hand, {c}generates someBoolean combination of constraints all of which have the form w j > 0. We begin by considering the set S 024 ~ 0 [KB] {rather than S ~ 0 [KB ]}, so we can ignore the latter constraints for now. S 024 ~ 0 [KB] is defined by a conjunction of linear constraints which {as discussed earlier} implies that it is convex, and thus has a unique maximum-entropy point, say ~v. Let 026 003 = 026 ~v be the distribution over 012 corresponding to ~v. It is clear that the constraints of 000 024 {KB[ ~ 0]} on the points of 001 K are precisely the same ones as those of 003. Therefore, 026 003 is the unique maximum-entropy distribution satisfying the constraints of 003. By Remark 4.13, it follows that F [030 fi 0 ] {~v} = 026 003 {fi 0 }. Since we have assumed that 026 003 {fi 0 } > 0, we are are almost in a positionto use Theorem 4.11. Itremains to prove essential positivity. Recall that the difference between 000 024 {KB [ ~ 0]} and 000{KB [ ~ 0]} is that the latter may have some conjuncts of the form w j > 0. Checking definitions 3.4 and 3.6 we see that such terms can appear only due to 030 fi 0 {c} and, in fact, together they assert that F [030 fi 0 ] { ~w} > 0. But we have assumed that F [030 fi 0 ] {~v} > 0 and so ~v is a maximum-entropy point of S ~ 0 [KB ] as well. Thus, essential positivity holds and so, by Theorem 4.11, Pr 1 {'{c}| {c} ^ KB 0 [003]} = F ['j ] {026 003 } = Pr 026 003 {fi|fi 0 } as required.79Grove, Halpern, & Koller Theorem 4.15: Let c be a constant symbol. Using the translation described in Section 4.3, for a set R of defeasible rules, B ! C is an ME-plausible consequence of R iff Pr 1 030 C {c} fi fi fi fi fi 030 B {c} ^ ^ r2R 022 r ! = 1: Proof: Let KB 0 denote V r2R 022 r . For all sufficiently small ~034 and for * = 034 1 , let 026 003 denote 026 003 * ;R . It clearly suffices to prove that Pr ~034 1 {030 C {c}|030 B {c} ^ KB 0 } = Pr 026 003 {C|B}; where by equality we also mean that one side is defined iff the other is also defined. It is easy to verify that a point ~u in 001 K satisfies 000{KB 0 [~034]} iff the corresponding distribution 026 * -satisfies R. Therefore, the maximum-entropy point ~v of S ~034 [KB 0 ] {which is unique, bylinearity} corresponds precisely to 026 003 . Now, there are two cases: either 026 003 {B} > 0 or 026 003 {B} = 0. In the first case, by Remark 4.13, Pr 026 003 {030 B {c}) = F [030 B {c}] {~v}, so the latter is also positive. This also implies that~v is consistent with the constraints 000{ {c}) entailed by {c} = 030 B {c}, so that ~v is also the unique maximum-entropy point of S ~034 [KB] {where KB =030 B {c} ^ KB 0 }. We can therefore use Corollary 4.10 and Remark 4.13 to conclude that Pr ~034 1 {030 C {c}|KB} = F [030 C {c}j030 B {c}] {~v} = Pr 026 003 {C|B} and that all three terms are well-defined. Assume, on the other hand, that 026 003 {B} = 0, so that Pr 026 003 {C|B} is not well-defined. In this case, we can use a known result {see {Paris & Vencovska, 1989}) for the maximum-entropy point over a space defined by linear constraints, and conclude that for all 026 satisfying R, necessarily026{B} = 0. Using the connection between distributions 026 satisfying R and points ~u in S ~034 [KB 0 ], we conclude that this is also the case for all ~u 2 S ~034 [KB 0 ]. By part {a} of Theorem B.2, this means that in any world satisfying KB 0 , the proportion jj030 B {x}jj x is necessarily 0. Thus, KB 0 is inconsistent with 030 B {c}, and Pr ~034 1 {030 C {c}|030 B {c} ^ KB 0 } is also not well-defined.Appendix D. Proofs for Section 4.4 Theorem 4.24: If KB and ~034 > ~ 0 are stable for 033 003 then Pr ~034 1 {033 003 |KB} = 1. Proof: By Theorem 3.14, it suffices to show that there is some open neighborhood con- taining Q, the maximum-entropy points of S ~034 [KB], such that every world W of KB inthis neighborhood has 033{W } = 033 003 . So suppose this is not the case. Then there is some sequence of worlds W 1 ; W 2 ; : : : suchthat {W i ; ~034}|= KB ^ :033 003 and lim i!1 min ~v2Q |031{W i } 000 ~v| = 0. Since 001 K is compact the sequence 031{W 1 }; 031{W 2 }; : : : must have at least one accumulation point, say ~u. This point must be in the closure of the set Q. But, in fact, Q is a closed set {because entropy is a continuous function} and so ~u 2 Q. By part {a} of Theorem B.2, 031{W i } 2 S ~034 [KB ^ :033 003 ] for every i and so, since this space is closed, ~u 2 S ~034 [KB ^ :033 003 ] as well. But this means that ~u is an unsafe maximum-entropy point, contrary to the definition and assumption of stability.In the remainder of this section we prove Theorem 4.28. For this purpose, fix KB = ^ KB 0 , ', and 033 003 to be as in the statement of this theorem, and let ~v be the unique maximum-entropypoint of S ~ 0 [KB ]. 80Random Worlds and Maximum Entropy Let Z ={c 1 ; : : : ; c m } be the set of constant symbols appearing in and in '. Due to the separability assumption, KB 0 contains none of the constant symbols in Z. Let 037 6= be the formula V i6=j c i 6= c j . We first prove that 037 6= has probability 1 given KB 0 . Lemma D.1: For037 6= and KB 0 as above, Pr 1 {037 6= |KB 0 }= 1. Proof: We actually show that Pr 1 {:037 6= |KB 0 } = 0. Let c and c 0 be two constant symbols in{c 1 ; : : : ; c m } and consider Pr 1 {c = c 0 |KB 0 }. We again use the direct inference technique. Note that for any world of size N the proportion expression jjx = x 0 jj x;x 0 denotes exactly 1=N . It is thus easy to see that Pr 1 {jjx = x 0 jj x;x 0 031 i 0|KB 0 } = 1 {for any choice of i}. Thus, byTheorem 3.16, Pr 1 {c = c 0 |KB 0 } = Pr 1 {c = c 0 |KB 0 ^ jjx = x 0 jj x;x 0 031 i 0}. Butsince c and c 0 appear nowhere in KB 0 we can use Theorem 4.1 to conclude that Pr 1 {c = c 0 |KB 0 } = 0. It is straightforward to verify that, since :037 6= is equivalent to a finite disjunction, each disjunctof which implies c = c 0 for at least one pair of constants c and c 0 , we must have Pr 1 {:037 6= |KB 0 } = 0.As we stated in Section 4.4, our general technique for computing the probability of an arbitrary formula ' is to partition the worlds into a finite collection of classes such that ' behaves uniformly over each class and then to compute the relative weights of the classes. As we show later, the classes are essentially defined using complete descriptions. Their relative weight corresponds to the probabilities of the different complete descriptions given KB . Proposition D.2: Let KB = KB 0 ^ and ~v be as above. Assume that Pr 1 { |KB 0 } > 0. Let D be a complete description over Z that is consistent with . {a} If D is inconsistent with 037 6= , then Pr 1 {D|KB}= 0. {b} If D is consistent with 037 6= , then Pr 1 {D|KB} = F [D] {~v}P D 0 2A{ ^037 6= } F [D 0 ] {~ v} : Proof: First, observe that if all limits exist and the denominator is nonzero, then Pr 1 {:037 6= | ^ KB 0 } = Pr 1 {:037 6= ^ |KB 0 }Pr 1 { |KB 0 } : By hypothesis, the denominator is indeed nonzero. Furthermore, by Lemma D.1, Pr 1 {:037 6= ^ |KB 0 } 024 Pr 1 {:037 6= |KB 0 } = 0. Hence Pr 1 {037 6= |KB} = Pr 1 {037 6= |KB 0 ^ } = 1. We can therefore use Theorem 3.16 to conclude that Pr 1 {D|KB}= Pr 1 {D|KB ^037 6= }: Part {a} of the proposition follows immediately. To prove part {b}, recall that is equivalent to the disjunction W E2A{ } E. By simple probabilistic reasoning, the assumption that Pr 1 { |KB 0 } > 0, and part {a}, we conclude that Pr 1 {D| ^ KB 0 } = Pr 1 {D ^ |KB 0 }Pr 1 { |KB 0 } = Pr 1 {D ^ |KB 0 }P E2A{ ^037 6= } Pr 1 {E|KB 0 } : 81Grove, Halpern, & Koller By assumption, D is consistent with 037 6= and is in A{ }. Since D is a complete description, we must have that D } is valid. Thus, the numerator on the right-hand side of this equationis simply Pr 1 {D|KB 0 }. Hence, the problem of computing Pr 1 {D|KB} reduces to a series of computations of the form Pr 1 {E|KB 0 } for various complete descriptions E. Fix any such description E. Recall that E can be decomposed into three parts: the unary part E 1 , the non-unary part E >1 , and the equality part E = . Since E is in A{037 6= }, we conclude that037 6= is equivalent to E = . Using Theorem 3.16 twice and some probabilistic reasoning, we get: Pr 1 {E >1 ^E 1 ^ E = |KB 0 } = Pr 1 {E >1 ^ E 1 ^ E = |KB 0 ^ 037 6= } = Pr 1 {E >1 ^ E 1 |KB 0 ^ 037 6= } = Pr 1 {E >1 |KB 0 ^ 037 6= ^ E 1 } 001 Pr 1 {E 1 |KB 0 ^ 037 6= } = Pr 1 {E >1 |KB 0 ^ 037 6= ^ E 1 } 001 Pr 1 {E 1 |KB 0 }: In order to simplify the first expression, recall that none of the predicate symbols in E >1 occur anywhere in KB 0 ^ 037 6= ^ E 1 . Therefore, the probability of E >1 given KB 0 ^ 037 6= is equal to the probability that the elements denoting the jZj {different} constants satisfy some particular configuration of non-unary properties. It should be clear that, by symmetry, all such configurations are equally likely. Therefore, the probability of any one of them is a constant, equal to 1 over the total number of configurations. 14 Let 032 denote the constant which is equal to Pr 1 {E >1 |KB 0 ^ 037 6= ^ E 1 } for all E. The last step is to show that, if E 1 is equivalent to V m j=1 A i j {c j }, then Pr 1 {E 1 |KB 0 } = F [D] {~v}: Pr 1 { m ^ j=1 A i j {c j }|KB 0 } = Pr 1 {A i 1 {c 1 }| m ^ j=2 A i j {c j } ^ KB 0 } 001 Pr 1 {A i 2 {c 2 }| m ^ j=3 A i j {c j } ^ KB 0 } 001 : : : 001 Pr 1 {A i m0001 {c m0001 }|A i m {c m } ^ KB 0 } 001 Pr 1 {A i m {c m }|KB 0 } = v i 1 001 : : : 001 v i m {using Theorem 4.11; see below} = F [D] {~v}: The first step is simply probabilistic reasoning. The second step uses m applications of Theorem 4.11. It is easy to see that A i j {c j } is a simple query for A i j+1 {c j+1 } ^ : : : ^ A i m {c m } ^ KB 0 . We would like to show that Pr 1 {A i j {c j }| m ^ `=j+1 A i ` {c ` } ^ KB 0 } = Pr 1 {A i j {c j }|KB 0 } = v i j ; where Theorem 4.11 justifies the last equality. To prove the first equality, we show that for all j, the spaces S ~ 0 [KB 0 ] and S ~ 0 [ V m `=j+1 A i ` {c j } ^ KB 0 ] have the same maximum-entropy point, namely ~v. This is proved by backwards induction; the j = m case is trivially true. The difference between the {j000 1}st and jth case is the added conjunct A i j {c j }, which amounts to adding the new constraint w i j > 0. There are two possibilities. First, if v i j > 0,14. Although we do not need the value of this constant in our calculations below, it is in fact easy to verify that its value is Q R2{010000011} 2 m arity{R} , where m = jZj. 82Random Worlds and Maximum Entropy then ~v satisfies this new constraint anyway and so remains the maximum-entropy point, completing this step of the induction. If v i j = 0 this is not the case, and indeed, the property we are trying to prove can be false {for j < m}. Butthis does not matter, because we then know that Pr 1 {A i j {c j }| V m `=j+1 A i ` {c ` }^KB 0 } = Pr 1 {A i j {c j }|KB 0 } = v i j = 0. Since both of the products in question include a 0 factor, it is irrelevant as to whether the other terms agree. We can now put everything together to conclude that Pr 1 {D|KB} = Pr 1 {D|KB 0 }P E2A{ ^037 6= } Pr 1 {E|KB 0 } = F [D] {~v}P E2A{ ^037 6= } F [E] {~ v} ; proving part {b}.We now address the issue of computing Pr 1 {'|KB} for an arbitrary formula '. To do that, we must first investigate the behavior of Pr ~034 1 {'|KB} for small ~034 . Fix some sufficiently small ~034 > 0, and let Q be the set of maximum-entropy points of S ~034 [KB]. Assume KB and ~034 are stable for 033 003 . By definition, this means that for every ~v 2 Q, we have 033{~v} = 033 003 . Let I bethe set of i's for which 033 003 contains the conjunct 9xA i {x}. Since 033{~v} = 033 003 for all ~v, we must have that v i > 0 for all i 2 I. Since Q is a closed set, this implies that there exists some * > 0 such that for all ~v 2 Q and for all i 2 I, we have v i > * . Let 022[ * ] be the formula ^ i2I jjA i {x}jj x > * : The following proposition is now easy to prove: Proposition D.3: Suppose that KB and ~034 arestable for 033 003 and that Q, i, 022[ * ], and 037 6= are as above. Then Pr ~034 1 {'|KB} = X D2A{ } Pr ~034 1 {'|KB 0 ^ 022[ * ] ^ 033 003 ^ D} 001 Pr ~034 1 {D|KB}: Proof: Clearly,022[ * ] satisfies the conditions of Corollary 3.14, allowing us to conclude that Pr ~034 1 {022[ * ]|KB} = 1. Similarly, byTheorem 4.24 and the assumptions of Theorem 4.28, we can conclude that Pr ~034 1 {033 003 |KB} = 1. Since the conjunction of two assertions that have probability 1 also has probability 1, we can use Theorem 3.16 to conclude that Pr ~034 1 {'|KB} = Pr ~034 1 {'|KB ^022[ * ] ^ 033 003 }. Now, recall that is equivalent to the disjunction W D2A{ } D. By straightforward probabilistic reasoning, we can therefore conclude that Pr ~034 1 {'|KB ^022[ * ] ^ 033 003 } = X D2A{ } Pr ~034 1 {'|KB ^022[ * ] ^ 033 003 ^ D} 001 Pr ~034 1 {D|KB ^022[ * ] ^ 033 003 }: By Theorem 3.16 again, Pr ~034 1 {D|KB ^022[ * ] ^ 033 003 } = Pr ~034 1 {D|KB}. The desired expression now follows.We now simplify the expression Pr ~034 1 {'|KB ^022[ * ] ^ 033 003 ^ D}: 83Grove, Halpern, & Koller Proposition D.4: For', KB , 033 003 , D, and 022[ * ] as above, if Pr ~034 1 {D|KB} > 0, then Pr ~034 1 {'|KB ^ 022[ * ] ^ 033 003 ^ D} = Pr 1 {'|033 003 ^ D}; and its value is either 0 or 1. Note that since the latter probability only refers to first-order formulas, it is independent of the tolerance values. Proof: That the right-hand side is either 0 or 1 is proved in {Grove et al., 1993b}, where it is shown that the asymptotic probability of any pure first-order sentence when conditioned on knowledge of the form 033 003 ^ D {which is, essentially, what was called a model description in {Grove et al., 1993b}) is either 0 or 1. Very similar techniques can be used to show that the left-hand side is also either 0 or 1, and that the conjuncts KB ^022[ * ] do not affect this limit {so that the left-hand side and the right-hand side are in fact equal}. We briefly sketch the relevant details here, referring the reader to {Grove et al., 1993b} for full details. The idea {which actually goes back to Fagin {1976}) is to associate with a model descrip- tion such as 033 003 ^ D a theory T which essentially consists of extension axioms. Intuitively, anextension axiom says that any finite substructure of the model defined by a complete description D 0 can be extended in all possible ways definable by another description D 00 . We say that a description D 00 extends adescription D 0 if all conjuncts of D 0 are also conjuncts in D 00 . An extension axiom has the form 8x 1 ; : : : ; x j {D 0 } 9x j+1 D 00 }, where D 0 is a complete description over X = {x 1 ; : : :; x j } and D 00 is a complete description over X [ fx j+1}, such that D 00 extends D 0 , both D 0 and D 00 extend D, and both are consistent with 033 003 . It is then shown that {a} T is complete {so that for each formula 030, either T|= 030 or T|= :030} and {b} if 030 2 T then Pr 1 {030|033 003 ^ D} = 1. From {b} it easily follows that if T |= 030, then Pr 1 {030|033 003 ^ D} is also 1. Using {a}, the desired 0-1 law follows. The only difference from the proof in {Grove et al., 1993b} is that we need to show that {b} holds even when we condition on KB ^ 022[ * ] ^ 033 003 ^ D, instead of just on 033 003 ^ D. So suppose 030 is the extension axiom 8x 1 ; : : :; x j {D 0 } 9x j+1 D 00 }. We must show that Pr 1 {030|KB ^ 022[ * ] ^ 033 003 ^ D} = 1. We first want to show that the right-hand side of the conditional is consistent. As observed in the previous proof, it follows from Theorem 3.16 that Pr 1 {D|KB} = Pr ~034 1 {'|KB ^022[ * ] ^ 033 003 }. Sincewe are assuming that Pr 1 {D|KB} > 0, it follows that Pr 1 {KB ^ 022[ * ] ^ 033 003 ^ D} > 0, and hence KB ^ 022[ * ] ^ 033 003 ^ D must be consistent. Fix a domain size N and consider the set of worlds satisfying KB ^022[ * ] ^ 033 003 ^ D. Now consider some particular j domain elements, say d 1 ; : : : ; d j , that satisfy D 0 . Observe that, since D 0 extends D, the denotations of the constants are all among d 1 ; : : : ; d j . For a given d 62 fd 1 ; : : : ; d j }, let B{d} denote the event that d 1 ;: : : ; d j ; d satisfy D 00 , given that d 1 ;: : : ; d j satisfy D 0 . What is the probability of B{d} given KB ^022[ * ] ^ 033 003 ^ D? First, note that since d does not denote any constant, it cannot be mentioned in any way in the knowledge base. Thus, this probability is the same for all d. The description D 00 determines two types of properties for x j+1 . The unary properties of x j+1 itself|i.e., the atom A i to which x j+1 mustbelong|and the relations between x j+1 and the remaining variables x 1 ; : : : ; x j using the non-unary predicate symbols. Since D 00 is consistent with 033 003 , the description 033 003 mustcontain a conjunct 9x A i {x} if D 00 implies A i {x j+1 }. By definition, 022[ * ] must therefore contain the conjunct jjA i {x}jj x > * . Hence, the probability of picking d in A i is at least * . For any sufficiently large N , the probability of picking d in A i which is different from d 1 ; : : : ; d j {as required by the definition of the extension axiom} is at least * =2 > 0. The 84Random Worlds and Maximum Entropy probability that d 1 ; : : : ; d j ; d also satisfy the remaining conjuncts of D 00 , given that d is in atom A i and d 1 ; : : :; d j satisfy D 0 , is very small but bounded away from 0. {For this to hold, we need the assumption that the non-unary predicates are not mentioned in the KB .} This is the case because the total number of possible waysto choose the properties of d {as they relate to d 1 ; : : : ; d j } is independent of N . We can therefore conclude that the probability of B{d} {for sufficiently large N }, given that d 1 ; : : : ; d j satisfy D, is bounded away from 0 by some 025 independent of N . Since the properties of an element d and its relation to d 1 ; : : :; d j can be chosen independently of the properties of a different element d 0 , the different events B{d}; B{d 0 }; : : : are all independent. Therefore, the probability that there is no domain element at all that, together with d 1 ; : : :; d j , satisfies D 00 is at most {1 000 025} N000j . This bounds the probability of the extension axiom being false, relative to fixed d 1 ; : : : ; d j . There are 000 N j 001 ways of these choosing j elements, so the probability of the axiom being false anywhere in a model is at most 000 N j 001 {1 000 025} N000j . This tends to 0 as N goes to infinity. Therefore, the extension axiom 8x 1 ; : : :; x j {D 0 } 9x j+1 D 00 } has asymptotic probability 1 given KB ^ 022[ * ] ^ 033 003 ^ D, as desired.Finally, we are in a position to prove Theorem 4.28. Theorem 4.28: Let ' be a formula in L 031 and let KB = KB 0 ^ be an essentially positive knowledge base in L 031 1 which is separable with respect to '. Let Z bethe set of constantsappearing in ' or in {so that KB 0 contains none of the constants in Z} and let 037 6= be the formula V c;c 0 2Z c 6= c 0 . Assume that there exists a size description 033 003 such that, for all ~034 > 0, KB and ~034 arestable for 033 003 , and that the space S ~ 0 [KB ] has a unique maximum-entropy point~v. Then Pr 1 {'|KB} = P D2A{ ^037 6= } Pr 1 {'|033 003 ^ D}F [D] {~v}P D2A{ ^037 6= } F [D] {~ v} if the denominator is positive. Proof: Assume without loss of generality that mentions all the constant symbols in ', so that A{ ^ 037 6= } 022 A{ }. ByProposition D.3, Pr ~034 1 {'|KB} = X D2A{ } Pr ~034 1 {'|KB ^022[ * ] ^ 033 003 ^ D} 001 Pr ~034 1 {D|KB}: Note that we cannot easily take limits of Pr ~034 1 {'|KB ^022[ * ] ^ 033 003 ^ D} as ~034 goes to ~ 0, because this expression depends on 022[ * ] and the value of * used depends on the choice of ~034 . However, applying Proposition D.4, we get Pr ~034 1 {'|KB} = X D2A{ } Pr 1 {'|033 003 ^ D} 001 Pr ~034 1 {D|KB}: We can now take the limit as~034 goes to ~ 0. To do this, we use Proposition D.2. The hypotheses of the theorem imply that Pr 1 { |KB 0 } > 0 {for otherwise, the denominator P D2A{ ^037 6= } F [D] {~v} would be zero}. Part {a} of the proposition tells us we can ignore those complete descriptions that are inconsistent with 037 6= . We can now apply part {b} to get the desired result.85Grove, Halpern, & Koller Acknowledgements We are very grateful to Professor Gregory Brumfiel, of the Department of Mathematics at StanfordUniversity, for his invaluable help with the proof of Proposition B.5. We would like to thank Fahiem Bacchus, with whom we started working on this general area of research, and Moshe Vardi for useful comments on a previous draft of this paper. A preliminary version of this paper appeared in Proc. 7th IEEE Symposium on Logic in Computer Science. Some of this research was performed while Adam Grove and Daphne Koller were at Stanford University and at the IBM Almaden Research Center. Thisresearch was sponsored in part by the Air Force Office of Scientific Research {AFSC}, under Contract F49620-91-C-0080, by an IBM Graduate Fellowship, and by a University of California President's Postdoctoral Fellowship. References Bacchus, F. {1990}. Representing and Reasoning with Probabilistic Knowledge. MITPress, Cambridge, Mass. Bacchus, F., Grove, A. J., Halpern, J. Y., & Koller, D. {1994}. Fromstatistical knowledge bases to degrees of belief. Tech. rep. 9855, IBM. Available by anonymous ftp from logos.uwaterloo.ca/pub/bacchus or via WWW at http://logos.uwaterloo.ca. A prelim- inary version of this work appeared in Proc. Thirteenth International Joint Conference on Artificial Intelligence {IJCAI '93}, 1993, pages 563{569. Bochnak, J., Coste, M., & Roy, M. {1987}. Geom etrie Algebrique Reelle, Vol. 12 of A Series of Modern Surveys in Mathematics. Springer-Verlag, Berlin Heidelberg. Carnap, R. {1950}. Logical Foundations of Probability. University of Chicago Press, Chicago. Carnap, R. {1952}. The Continuum of Inductive Methods. University of Chicago Press, Chicago. Cheeseman, P. C. {1983}. A method of computing generalized Bayesian probability val- uesfor expert systems. In Proc. Eighth International Joint Conference on Artificial Intelligence {IJCAI '83}, pp. 198{202. Denbigh, K. G., & Denbigh, J. S. {1985}. Entropy in Relation to Incomplete Knowledge. Cambridge University Press, Cambridge, U.K. Fagin, R. {1976}. Probabilities on finite models. Journal of Symbolic Logic, 41 {1}, 50{58. Fagin, R., Halpern, J. Y., & Megiddo, N. {1990}. A logic for reasoning about probabilities. Information and Computation, 87 {1/2}, 78{128. Geffner, H., & Pearl, J. {1990}. A framework for reasoning with defaults. In Kyburg, Jr., H. E., Loui, R., & Carlson, G. {Eds.}, Knowledge Representation and Defeasible Reasoning,pp. 245{26. Kluwer Academic Press, Dordrecht, Netherlands. 86Random Worlds and Maximum Entropy Ginsberg, M. L. {Ed.}. {1987}. Readingsin Nonmonotonic Reasoning. Morgan Kaufmann, San Mateo, CA, San Francisco. Glebski025020, Y. V., Kogan, D. I., Liogon'ki025020, M. I., & Talanov, V. A. {1969}. Range and degree of realizability offormulas in the restricted predicate calculus. Kibernetika, 2, 17{28. Goldman, S. A. {1987}. Efficient methods for calculating maximum entropy distributions. Master's thesis, MIT EECS Department. Goldszmidt, M., Morris, P., & Pearl, J. {1990}. A maximum entropy approach to nonmono- tonic reasoning. In Proc. National Conference on Artificial Intelligence {AAAI '90}, pp. 646{652. Goldszmidt, M., Morris, P., & Pearl, J. {1993}. A maximum entropy approach to nonmono- tonic reasoning. IEEE Transactions of Pattern Analysis and Machine Intelligence, 15 {3}, 220{232. Grandjean, E. {1983}. Complexityof the first{order theory of almost all structures. Infor- mation and Control, 52, 180{204. Grove, A. J., Halpern, J. Y., & Koller, D. {1992}. Randomworlds and maximum entropy. In Proc. 7th IEEE Symp. on Logic in Computer Science, pp. 22{33. Grove, A. J., Halpern, J. Y., & Koller, D. {1993a}. Asymptotic conditional probabilities: thenon-unary case. Researchreport RJ 9564, IBM. Grove, A. J., Halpern, J. Y., & Koller, D. {1993b}. Asymptotic conditional probabilities: the unary case. Research report RJ 9563, IBM. To appear in SIAM Journal on Computing. Halpern, J. Y. {1990}. An analysis of first-order logics of probability. ArtificialIntelligence, 46, 311{350. Jaynes,E. T. {1957}. Information theory and statistical mechanics. Physical Review, 106 {4}, 620{630. Jaynes, E. T. {1978}. Where do we stand on maximum entropy?. In Levine, R. D., & Tribus, M. {Eds.}, The Maximum Entropy Formalism, pp. 15{118. MIT Press, Cambridge, Mass. Jaynes, E. T. {1982}. On the rationale of maximum-entropy methods. Proc. IEEE, 70 {9}, 939{952. Jaynes, E. T. {1983}. Concentration of distributions at entropy maxima. In Rosenkrantz, R. D. {Ed.}, E. T. Jaynes: Papers on Probability, Statistics, and Statistical Physics, pp. 315{336. Kluwer, Dordrecht, Netherlands. Keynes, J. M. {1921}. A Treatise on Probability. Macmillan, London. 87Grove, Halpern, & Koller Koller, D., & Halpern, J. Y. {1992}. A logic for approximate reasoning. InNebel, B., Rich, C., & Swartout, W. {Eds.}, Proc. Third International Conference on Principles of Knowledge Representation and Reasoning {KR '92}, pp. 153{164. Morgan Kaufmann, San Mateo, CA, San Francisco. Kyburg, Jr., H. E. {1983}. The reference class. Philosophyof Science,50 {3}, 374{397. Laplace, P. S. d. {1820}. Essai Philosophique sur les Probabilites. English translation is PhilosophicalEssay on Probabilities, Dover Publications, New York, 1951. Luce, R. D., & Raiffa, H. {1957}. Games and Decisions. Wiley, New York. Nilsson, N. {1986}. Probabilistic logic. Artificial Intelligence, 28, 71{87. Paris, J. B., & Vencovska, A. {1989}. On the applicability of maximum entropy to inexact reasoning. International Journal of Approximate Reasoning, 3, 1{34. Pearl, J. {1988}. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Francisco,Calif. Pearl, J. {1989}. Probabilistic semantics for nonmonotonic reasoning: A survey. In Brach- man, R. J., Levesque, H. J., & Reiter, R. {Eds.}, Proc. First International Conference on Principles of Knowledge Representation and Reasoning {KR '89}, pp. 505{516. Reprinted in Readings in Uncertain Reasoning, G. Shafer and J. Pearl {eds.}, Morgan Kaufmann, San Francisco, Calif., 1990, pp. 699{710. Pollock, J. L. {1984}. Foundations for direct inference. Theory and Decision, 17, 221{256. Reichenbach, H. {1949}. Theory of Probability. University of California Press, Berkeley. Shannon, C., & Weaver, W. {1949}. The Mathematical Theory of Communication. Univer- sity of Illinois Press. Shastri, L. {1989}. Defaultreasoning in semantic networks: a formalization of recognition and inheritance. Artificial Intelligence, 39 {3}, 285{355. Spiegelhalter, D. J. {1986}. Probabilistic reasoning in predictive expert systems. InKamal, L. N., & Lemmer, J. F. {Eds.}, Uncertainty in Artificial Intelligence, pp. 47{68. North- Holland, Amsterdam. Tarski, A. {1951}. A Decision Method for Elementary Algebra and Geometry {2nd edition}. Univ. of California Press. 88