S. Chien, A. Stechert and D. Mutz
Volume 10, 1999Links to Full Text:
The central problem we address is achieving the desired ranking quality while minimizing the cost of acquiring additional samples. We describe two algorithms for hypothesis ranking and their application for the probably approximately correct (PAC) and expected loss (EL) learning criteria. Empirical results are provided to demonstrate the effectiveness of these ranking procedures on both synthetic and real-world datasets.
© Copyright 1993-2006 AI Access Foundation, Inc.
Journal of Artificial Intelligence Research 10 (1999) 375-397 Submitted 2/99; published 6/99 Efficient Heuristic Hypothesis Ranking Steve Chien email@example.com Andre Stechert firstname.lastname@example.org Darren Mutz email@example.com Jet Propulsion Laboratory California Institute of Technology 4800 Oak Grove Drive, M/S 126-347 Pasadena, CA 91109-8099 Abstract This paper considers the problem of learning the ranking of a set of stochastic alternatives based upon incomplete information (i.e., a limited number of samples). We describe a system that, at each decision cycle, outputs either a complete ordering on the hypotheses or decides to gather additional information (i.e., observations) at some cost. The ranking problem is a generalization of the previously studied hypothesis selection problem--in selection, an algorithm must select the single best hypothesis, while in ranking, an algorithm must order all the hypotheses. The central problem we address is achieving the desired ranking quality while minimizing the cost of acquiring additional samples. We describe two algorithms for hypothesis ranking and their application for the probably approximately correct (PAC) and expected loss (EL) learning criteria. Empirical results are provided to demonstrate the effectiveness of these ranking procedures on both synthetic and real-world datasets. 1. Introduction In many applications, the cost of information can be quite high, imposing a requirement that learning algorithms glean as much usable information as possible with a minimum of data. For example: ffl Data may be scarce, making learning the most possible from limited training data key. ffl In speedup learning, minimizing processing time is critical. Here, reducing the number of necessary training examples is key since the expense of processing each example can be significant (Tadepalli, 1992). ffl In decision tree learning, the cost of using all available training examples when evaluating potential attributes for partitioning can be computationally expensive (Musick, Catlett, & Russell, 1993). ffl In evaluating medical treatment policies, acquiring additional training examples might imply that human subjects are exposed to an experimental treatment for a longer period than is necessary. When one wishes some sort of guarantee on the quality of a solution, a statistical decision theoretic framework is useful. The framework answers the questions: How much information cfl1999 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved. Chien, Stechert, & Mutz is enough? At what point do we have adequate information to rank the alternatives with some requested confidence? This paper focuses on parametric ranking problems, a general class of statistical machine learning problems in which the goal is to rank a set of alternative hypotheses where the goodness of a hypothesis is a function of a set of parameters whose values are unknown (e.g., Chien, Stechert, & Mutz, 1998; Gratch, 1992; Greiner & Jurisica, 1992; Kaelbling, 1993; Moore & Lee, 1994; Musick et al., 1993). The learning system determines and refines estimates of these parameters by using training examples, with a secondary goal of minimizing learning cost. The principal contributions of this paper are: ffl We define two families of hypothesis ranking algorithms, based on recursive selection and adjacency, respectively. We provide specific details on how to apply them using the probably approximately correct (PAC) and expected loss (EL) decision criteria. ffl We provide empirical results demonstrating the effectiveness of these algorithms at achieving the requested decision criteria on synthetic data. ffl We provide empirical results showing that these algorithms significantly outperform existing statistical methods on real-world data from spacecraft design optimization and image compression applications. The remainder of this paper is structured as follows. First, we describe the hypothesis ranking problem more formally, including definitions for the probably approximately correct (PAC) and expected loss (EL) decision criteria. We then define two algorithms for establishing these criteria for the hypothesis ranking problem--a recursive hypothesis selection algorithm and an adjacent comparison algorithm. Next, we describe empirical tests demonstrating the effectiveness of these algorithms as well as documenting their improved performance over a standard algorithm from the statistical ranking literature. Finally, we describe related work and future extensions to the algorithms. 2. Hypothesis Ranking Problems Hypothesis ranking problems are an abstract class of learning problems where an algorithm is given a set of hypotheses to rank. The ranking desired is that which orders the hypotheses by their expected utility, which is determined by the hypothesis' underlying probability distribution. These expected utilities are unknown to the algorithm and must be estimated from the training data. Hypothesis ranking problems are an extension of hypothesis selection problems (Chien, Gratch, & Burl, 1995), in which a learning system attempts to select the best alternative from a set of hypotheses. The distinction between hypothesis ranking and hypothesis selection is that in selection the learning algorithm is interested in a single best hypothesis, while in ranking the learning algorithm must determine the relative order of all of the hypotheses1. Hypothesis selection and ranking is an important aspect of many machine learning problems. For example, the utility problem in speedup learning can be viewed as a selection 1. The algorithms and results described in this paper extend in a straightforward fashion to hybrid rankingselection problems in which the system must select and rank the top M out of N hypotheses. 376 Efficient Heuristic Hypothesis Ranking problem where a single problem-solving heuristic or strategy is chosen from a larger set of candidates. In this case, the expected utility is typically defined as the average time to solve a problem (Gratch, 1992; Greiner & Jurisica, 1992; Minton, 1988). The attribute selection problem in machine learning can also be viewed as a hypothesis selection problem in which one must select the best attribute split from a set of possible attribute splits and utility is often measured by information gain (Musick et al., 1993). In reinforcement learning, a system must learn the appropriate action for each context, where utility is interpreted as expected reward (Kaelbling, 1993).2 A key observation regarding each of these problems (and all learning problems, in general) is that each of them could be viewed as an optimization problem, where the utility is the function being optimized. Then, the application of traditional (or non-traditional) optimization methods will yield good results within the guarantees provided by the algorithm and depending on the features of the landscape being optimized. However, with the addition of a model of sampling cost, a new degree of freedom is added to the problem. Where the cost of samples is very high, traditional optimization algorithms will fare poorly. Additionally, while in many of the mentioned applications the system chooses a single alternative and never revisits the decision, there are also many cases for which a system will want to investigate several prioritized options (either serially or in parallel), and hence a ranking is useful. Motivation is provided by the following scenarios: ffl Upper and lower bounds, span: Minimax search algorithms can use metaknowledge (such as upper and lower bounds of a node) for pruning other parts of the tree. Also, there are times when knowing the span of the expected utilities of the candidate set is useful (e.g., when checking for convergence conditions in an adaptive algorithm such as a GA). ffl Augmenting external knowledge: Another area in which hypothesis ranking may have important applications is hypothesis selection with human supervision. When the stochastic objective function (i.e., the hypothesis) represents only a part of the problem, the ranking can be used to augment external knowledge of the problem. For example, engineering simulations usually capture the physical properties of the candidate designs, but usually choose to forego the details of manufacturing, logistics, and economics. ffl The entire ranking: In some cases, the entire ranking is significant. For instance, in evolutionary algorithms, the individuals to be propagated to future generations are often selected with likelihood that is proportionate to their rank in the current generation (Goldberg, 1989). Another example arises in the case of search algorithms that take advantage of node ordering heuristics, such as beam search or iterative broadening (Ginsberg & Harvey, 1992). In any hypothesis evaluation problem, always achieving a correct ranking is impossible in practice, because the exact underlying probability distributions are unknown. Thus, there is always a (perhaps vanishingly) small chance that the algorithms will be unlucky 2. Note that the analogous reinforcement learning problem is the one in which we are learning the appropriate action with immediate feedback rather than delayed feedback. 377 Chien, Stechert, & Mutz because only a finite number of samples can be taken. Consequently, rather than always requiring an algorithm to output a correct ranking, we impose probabilistic criteria on the rankings to be produced. While several families of such requirements exist, in this paper we examine two criteria: the probably approximately correct (PAC) model for selecting a hypothesis function that approximates well a target function (Valiant, 1984) and the expected loss (EL) requirement frequently used in decision theory and gaming problems (Russell & Wefald, 1992). Informally, to satisfy the PAC requirement, an algorithm must produce a result that with high probability is close to correct (e.g., incorrect orderings will be most likely to occur between hypotheses with similar expected utilities). The satisfy the EL requirement, on the other hand, a bound must be established on the expected loss of the result, where loss is the difference in utilities between two incorrectly ordered hypothese in an incorrect ranking. The expected utility of a hypothesis can be estimated by observing its values over a finite set of training examples. However, to satisfy the decision criteria, an algorithm must also be able to reason about the potential difference between the estimated and true utilities of each hypotheses. Let Ui denote the true expected utility of hypothesis i and let ^Ui be the estimated expected utility of hypothesis i. Without loss of generality, let us presume that the proposed ranking of hypotheses is U1 ? U2 ?; :::; ? UkGamma 1 ? Uk. The PAC requirement states that, for some user-specified ffl, with probability 1 Gamma ffi: kGamma 1^ i=1 [(Ui + ffl) ? MAX(Ui+1; :::; Uk)] (1) In the context of the PAC criterion, the number ffl is called the indifference interval and ffi is the overall ranking error or total error rate. 3 The issue of how to allocate the overall ranking error among the many possible pairwise comparisons of hypotheses is discussed in the next section. Correspondingly, when selecting a hypothesis H1 to be the best from a set of k hypotheses H1; :::; Hk, let the selection loss L be as follows. L(H1; fH1; :::; Hkg) = MAX(0; MAX(U2; :::; Uk) Gamma U1) (2) Then, the ranking loss RL of a ranking H1; :::; Hk would be: RL(H1; :::; Hk) = kGamma 1X i=1 L(Hi; fHi+1; :::; Hkg) (3) 3. The distinction betwen the true means and the estimated means (for which we use the sample means) is a confusing one. When assessing the validity of a ranking produced by an algorithm, one would use the true means of the distributions (if available, as in test distributions) or the most accurate estimation possible (such as from an edxtremely large sampling of the distribution). However, a ranking algorithm uses the estimated parameters (including sample mean) to estimate the error. For estimation of a single mean the estimate of the mean is normally distributed around the true mean so that this usage is justified. However, we have not proven (and indeed are unsure) whether using the estimate in more complex ranking and selection contexts is guaranteed correct (see later section on the heuristic nture of our algorithms). 378 Efficient Heuristic Hypothesis Ranking A hypothesis ranking algorithm which obeys the expected loss requirement must produce rankings that on average have less ranking loss than the requested expected loss bound. The policy for loss allocation is also discussed in the next section. As an example, consider ranking the hypotheses with expected utilities: U1 = 1:0; U2 = 0:95; U3 = 0:86. The ranking U2 ? U1 ? U3 is a valid PAC ranking for the indifference interval ffl = 0:06 but not for ffl = 0:01 and the observed ranking loss is 0:05 + 0 = 0:05. However, while the confidence in a pairwise comparison between two hypotheses is well understood to be the complement of the probability of the comparison's result being in error, it is less clear how to define and ensure that a desired confidence is met in the set of comparisons required for a selection or the even more complex set of comparisons required for a ranking. Equation 4 defines the confidence that Ui + ffl ? Uj, when the utilities are normally distributed with unknown and unequal variances. fl = OE `( ^UiGamma j + ffl) pn^S iGamma j ' (4) where OE represents the cumulative standard normal distribution function, and n, ^UiGamma j, and ^SiGamma j are the size, sample mean, and sample standard deviation of the blocked differential distribution4, respectively. Likewise, computation of the expected loss for asserting an ordering between a pair of hypotheses is well understood, but the estimation of expected loss for an entire ranking is less clear. Equation 5 defines the expected loss for drawing the conclusion Ui ? Uj, again under the assumption of normality (see Chien et al., 1995, for further details). EL[Ui ? Uj] = ^Si Gamma je Gamma 0:5n( ^UiGamma j ^SiGamma j ) 2 p2ssn + ^Ui Gamma jp 2ss Z 1 Gamma ^UiGamma jpn ^SiGamma j e Gamma 0:5z2dz (5) In the next two subsections, we describe two interpretations for estimating the likelihood that an overall ranking satisfies the PAC or EL requirements by estimating and combining pairwise PAC errors or EL estimates. Each of these interpretations lends itself directly to an algorithmic implementation as described below. 2.1 Ranking as Recursive Selection One obvious way to determine a ranking H1; :::; Hk is to view ranking as recursive selection from the set of remaining candidate hypotheses. In this view, the overall ranking error, as specified by the desired confidence in PAC algorithms and the loss threshold in EL algorithms, is first distributed among kGamma 1 selection errors which are then further subdivided into pairwise comparison errors (Figure 1). Data is then sampled until the estimates of the pairwise comparison error (as dictated by equation 4 or 5) satisfy the bounds set by the algorithm. 4. Note that in our approach we block, or match, examples to further reduce sampling complexity. Blocking makes estimates by using the difference in utility between competing hypotheses on each observed example. Blocking can significantly reduce the variance in the data when the hypotheses are not independent. The differential distribution is formed by taking the differences of the blocked individual samples to form a new distribution. It is trivial to modify the formulas to address the cases in which it is not possible to block data (see Moore & Lee, 1994; Chien et al., 1995, for further details). 379 Chien, Stechert, & Mutz H1 H2 H3 H4 H5 H2 H3 H4 H5 H3 H4 H5 H4 H5Sigma g* Figure 1: Computing the overall error of a recursive ranking. The per-comparison errors are summed at each level in the recursion, and the overall sum (across all levels) is compared with the specified total error, flLambda . Thus, another degree of freedom in the design of recursive ranking algorithms is the method by which the overall ranking error is ultimately distributed among individual pairwise comparisons between hypotheses. Two factors influence the way in which we compute error distribution. First, our model of error combination determines how the error allocated for individual comparisons or selections combines into overall ranking error and therefore how many candidates are available for the distribution of error. Using Bonferroni's inequality, which asserts that the probability of a union of events is no greater than the sum of the probabilities of the individual events5, one would be inclined to combine the errors additively. However, following a more conservative approach, one can assert that because the predicted "best" hypothesis may change during sampling in the worst case, the conclusion might depend on all possible pairwise comparisons and that the error should be distributed among all Gamma n2Delta pairs of hypotheses.6 Second, our policy with respect to allocation of error among the candidate comparisons or selections determines how samples will be distributed. For example, in some contexts, the consequences of early selections far outweigh those of later selections. For these scenarios, we have implemented ranking algorithms that divide overall ranking error unequally in 5. Note that this is only the simplest of the Bonferonni inequalities, which fall into clean correspondence with the terms of the expansion of the probability of a union of events according to the principle of inclusion and exclusion in a natural way. 6. For a discussion of this issue, see pp. 18-20 of (Gratch, 1993). 380 Efficient Heuristic Hypothesis Ranking favor of earlier selections.7 Also, it is possible to divide selection error into pairwise error unequally based on estimates of hypothesis parameters in order to reduce sampling cost (for example, Gratch, Chien, & DeJong, 1994, allocates error rationally). Within the scope of this paper, we only consider algorithms that: (i) combine pairwise error into selection error additively, (ii) combine selection error into overall ranking error additively, and (iii) allocate error equally at each level. One disadvantage of recursive selection is that once a hypothesis has been selected, it is removed from the pool of candidate hypotheses. This is an issue in rare cases when, while sampling to increase the confidence of some later selection, the estimate for a hypothesis' mean changes enough that some previously selected hypothesis no longer dominates it. However, it remains that the original hypotheses were shown to dominate the others with a specified level of certainty, flLambda . These assumptions result in the following formulations (where ffi(U1 Lambda ffl fU2; :::; Ukg) is used to denote the error due to the action of selecting hypothesis 1 under Equation 1 from the set fH1; :::; Hkg and ffi(U1 Lambda fU2; :::; Ukg) denotes the error due to selection loss in situations where Equation 2 applies): ffirec(U1 ? U2 ? ::: ? Uk) = ffirec(U2 ? U3 ? ::: ? Uk) +ffi(U1 Lambda ffl fU2; :::; Ukg) (6) where ffirec(Uk) = 0 (the base case for the recursion) and the selection error is as defined in (Chien et al., 1995): ffi(U1 Lambda ffl fU2; :::; Ukg) = kX i=2 ffi1;i (7) using Equation 4 to compute pairwise confidence. Algorithmically, we implement this with the following pseudo-code: ensure there are n0 samples per hypothesis distribute the error to individual selections while (stopping criteria has not been met) take more samples if (means are ordered differently than ranking) restart the algorithm An analogous recursive selection algorithm based on expected loss is defined as follows ELrec(U1 ? U2 ? ::: ? Uk) = ELrec(U2 ? U3 ? ::: ? Uk) +EL(U1 Lambda fU2; :::; Ukg) (8) where ELrec(Uk) = 0 and the selection EL is as defined in (Chien et al., 1995): EL(U1 Lambda fU2; :::; Ukg) = kX i=2 EL(U1; Ui) (9) 7. Space constraints preclude their description here. 381 Chien, Stechert, & Mutz H1 H2 H3 Hk g1,2 g2,3 gk-1,k Hk-1* * * Sigma g* Figure 2: Computing the overall error in an adjacent ranking. Per-comparison errors between neighboring hypotheses in the proposed ranking are summed and compared with the required total error, flLambda . 2.2 Ranking by Adjacency Comparison Another interpretation of ranking confidence (or loss) is that only adjacent elements in the ranking need be compared. In this case, the overall ranking error is divided directly into k Gamma 1 pairwise comparison errors (Figure 2). This leads to the following confidence equation for the PAC criteria: ffiadj(U1 ? U2 ? ::: ? Uk) = kGamma 1X i=1 ffii;i+1 (10) And the following equation for the EL criteria. ELadj(U1 ? U2 ? ::: ? Uk) = kGamma 1X i=1 EL(Ui; Ui+1) (11) Because ranking by comparison of adjacent hypotheses does not establish dominance or loss bounds between non-adjacent hypotheses (where the hypotheses are ordered by observed mean utility), it has the advantage of requiring fewer comparisons than recursive selection (and thus may require fewer samples than recursive selection). However, for the same reason, adjacency algorithms may be less likely than the recursive selection algorithms to bound the probability of a correct ranking (or average loss) correctly. In the case of the PAC algorithms, this is because ffl-dominance is not necessarily transitive. In the case of the EL algorithms, it is because expected loss is not necessarily additive when considering two hypothesis comparisons sharing a common hypothesis.8 8. An example where ranking loss between non-adjacent hypotheses exceeds the desired loss bound for the ranking, even though the sum of the adjacent losses does not, occurs when the blocked differential distribution induced by two non-adjacent hypotheses has high variance relative to an hypothesis adjacent 382 Efficient Heuristic Hypothesis Ranking 2.3 The Heuristic Nature of the Algorithms Both the recusrsive selection and adjacency algorithms are heuristic in the sense that they are not proven to statistically meet the specified decision criteria (i.e., for the PAC criteria select a ranking that satisfies equation (1) with probability 1 Gamma ffi and similarly for the EL criteria average a ranking loss specified by equation (3) less than the requested bound. Indeed, several aspects of these algorithms make it extremely difficult to prove that they would (probabilistically) achieve the corresponding decision criteria. These aspects include: ffl Sharing of samples: In order to have n1 samples for a differential distribution (i.e. blocking) for H1 and H2, it takes n1 samples of H1 and n1 samples on the the same problems for H2. Our algorithms further reduce the sampling cost by reusing these samples in differential distributions comparing H1 to other hypotheses and H2 to other hypotheses. This makes the errors derived from these samples not independent. Hence we have traded accuracy and ease of analysis of the algorithms for heuristic efficiency. Particularly in the recursive selection approach, samples for the lowest ranking hypothesis would have been used in k Gamma 1 differential comparisons. ffl Heuristic error combination: Both the recursive selection and adjacency error combination models are heuristic means of combining pairwise errors. This is because the pairwise errors are not independent (see above). Empirically we have observed that the pairwise errors tend to be overestimated but the error combination function tends to under-combine. Overall empirically the combined error estimates tend to be reasonably accurate, as the remaining sections show. ffl Ignorance of lead switches and multiple comparison paths: During the sampling process, the ordering of the hypotheses may change (e.g., the ordering of sample means may change). This means that implicitly, the decision depended on an additional pairwise comparison that may not be reflected in the final set of comparisons contributing a pairwise error. This complexity could be avoided by fixing the order of the hypotheses after n0 samples. However, this would require more samples as is would involve showing ffl-dominance of a hypothesis over a higher sample mean hypothesis (indeed, it may never converge). We choose to ignore this complexity and base the combined error used in the stopping condition on the final ordering. ffl Use on non-normal distributions: In many of the applications described in the remainder of this article, the real-world data is distributed in a manner not very simlar to normal distributions (we further investigate this issue later in the article). The algorithms we describe are heuristic in that they presume that the data is normally distributed even though this is not the case. to both (i.e., currently ranked between them). The variance of the differential distribution makes its maximum contribution when the sample set is small, so, e.g., with _1Gamma 2 = 2, oe1Gamma 2 = 2, n1Gamma 2 = 2, _2Gamma 3 = 2, oe2Gamma 3 = 2, and n2Gamma 3 = 2, there exists a configuration for which _1Gamma 3 = 4, oe1Gamma 3 = 8. The expected losses are EL(H1; H2) = 2:05, EL(H2; H3) = 2:05, but EL(H1; H3) = 4:80 ? 4:10. 383 Chien, Stechert, & Mutz 2.4 Other Relevant Approaches Most standard statistical ranking/selection approaches make strong assumptions about the form of the problem (e.g., the variances associated with underlying utility distribution of the hypotheses might be assumed known and equal). Among these, the method of Turnbull and Weiss (Turnbull & Weiss, 1984) is most comparable to our PAC-based approach.9 Turnbull and Weiss' algorithm is a sequential interval-based procedure for selecting the member of a population with the largest mean. They treat hypotheses as normally distributed random variables of unknown mean that have unknown and possibly unequal variance. Their algorithm also carries the additional stipulation that the hypotheses be independent. The procedure consists of taking an initial sample of n0 observations on each of the hypotheses and then taking samples sequentially according to their stopping criteria. When the stopping criteria has been satisfied, the hypothesis with the highest sample mean is chosen. The stopping criteria is that the inequality S 2 in i ^ 1 nLambda is satisfied, where Si andn i are the sample mean and the number of samples of the ith hypothesis and nLambda is chosen according to the indifference interval ffl and the confidence level flLambda . In particular, nLambda = d 2 ffl2and d is chosen to satisfy R 1 Gamma 1(F (y +d)) kGamma 1f(y)dy = flLambda where F (y) and f (y) are the cumulative distribution function and probability density function of the standard normal distribution. While it is still reasonable to use this approach when the candidate hypotheses are not independent, excessive statistical error or unnecessarily large training set sizes may result. In the case that the hypotheses are truly independent, Turnbull and Weiss' technique should be able to exploit this knowledge and outperform our methods which do not adopt this assumption. 3. Empirical Performance Evaluation We now turn to empirical evaluation of the hypothesis ranking techniques on both synthetic and real-world datasets. This evaluation serves three purposes. First, it demonstrates that the techniques perform as predicted (in terms of bounding the probability of incorrect selection or expected loss). Second, it validates the performance of the techniques as compared to standard algorithms from the statistical literature. Third, the evaluation demonstrates the robustness of the new approaches to real-world hypothesis ranking problems. An experimental trial consists of solving a hypothesis ranking problem with a given technique and a given set of problem and control parameters. We measure performance by (1) how well the algorithms satisfy their respective criteria; and (2) the number of samples taken or, alternatively, the cost (in seconds) of executing the algorithm. Since the performance of these statistical algorithms on any single trial provides little information about its overall behavior, each trial is repeated multiple times and the results are averaged across trials. Synthetic experimental trials were repeated 500 times, while trials on the real-world data were repeated 100 times. Because the PAC and expected loss criteria are not directly comparable, the approaches are analyzed separately. 9. PAC-based approaches have been investigated extensively in the statistical ranking and selection literature under the topic of confidence interval based algorithms (see Haseeb, 1985, for a review of the recent literature). 384 Efficient Heuristic Hypothesis Ranking uu-eu-2eu-3eu-(k-1)e utility H1Hk H3H4 H2 Figure 3: The stepped means hypothesis configuration. 3.1 Evaluation on Synthetic Datasets Evaluation on synthetic data is used to show that: (1) the techniques correctly bound probability of incorrect ranking and expected loss as predicted when the underlying assumptions are valid even when the underlying utility distributions are inherently hard to rank 10, and (2) that the PAC techniques compare favorably to the algorithm of Turnbull and Weiss in a wide variety of circumstances. For the synthetic datasets, the utility distributions of the hypotheses were modeled as random variables defined on some underlying parameterized distribution. Thus, characterizing a ranking problem consists of choosing some number of hypotheses to rank and then assigning values for parameters representing each utility distributions for these hypotheses. In our case, we model the utilities as independent normal random variables with some mean and standard deviation. Thus, if we let k be the number of hypotheses, then each hypothesis ranking problem is described by the 2k parameters specifying the expected utility and utility standard deviation for each hypothesis. In general, while several more parameters may be required to characterize a ranking problem fully11, the number of hypotheses and the choices for the parameters of the utility distributions underlying these hypotheses characterize the overall difficulty of the ranking problem. The statistical ranking and selection community uses a standard family of selection problems with known difficulty to analyze the performance of hypothesis selection strategies. The method, called the least favorable configuration (LFC) of the population means is that assignment of the parameters to distributions which is most likely to cause a technique to choose a wrong hypothesis and thus provides the most severe test of the technique's abilities. Under this configuration, all utilities are independent normally distributed variables of equal variance. k Gamma 1 of the hypotheses have utilities with equal expectation, _, and the remaining hypothesis has expected utility _ + ffl. Because we are interested in hypothesis ranking problems rather than selection problems, we use a generalization of the LFC that we call stepped means. In this configuration, one of the hypotheses is assigned expected utility _ and successive hypotheses are assigned expected utility _ Gamma iffl for i from 1; :::; k Gamma 1 (Figure 3). In general, problems based on the least favorable configuration become more difficult (i.e., require more samples) when the number of hypotheses k increases, the common utility variance oe2 increases, or the difference in the means of the utility distributions decreases. In the standard methodology, a technique is evaluated by its ability to achieve a confidence of 10. Configurations that contain hypotheses with high variance relative to the separation between their means are more difficult to rank. 11. For instance, when samples are allocated rationally in (Chien et al., 1995), it becomes necessary to assign parameters to a cost distribution as well, or if only a few of the candidate hypotheses were to be ranked, the number of hypotheses to rank would be another problem parameter. 385 Chien, Stechert, & Mutz correct selection flLambda using several settings for k and oeffl . This last ratio combines oe and ffl into a single quantity which, as it increases, makes the problem more difficult. This methodology extends to stepped means directly. The hypothesis ranking strategies themselves have algorithm control parameters that govern how they attack a problem. The PAC techniques have three control parameters: an initial sample size n0, a desired confidence of correct ranking flLambda and an indifference setting ffl12. The expected loss techniques have two control parameters: an initial sample size n0 and a loss threshold HLambda . The observed number of samples required and achieved accuracy of the PAC techniques on the stepped means configuration are shown in Table 3.1. The results indicate that all systems are roughly comparable in the number of examples required to choose a hypotheses. As expected, the number of examples increases with k, flLambda , and oeffl . The P ACadj algorithm required the least number of samples but was inconsistent in meeting the desired accuracy bound (as indicated by its failure to meet the prescribed error bound in several cases). It is interesting that the Turnbull and Weiss method did not significantly outperform the PAC techniques despite the fact that the algorithm assumes that the hypotheses are independent (as is the case in the stepped means configuration), while the PAC approaches do not make this assumption. In this comparison, the principal performance metric is the number of samples required to achieve the requested ranking, both methods were effective at achieving the requested accuracy. In the expected loss experiments, we ran the expected loss hypothesis ranking algorithms on the same stepped means configurations described above with a range of expected loss bounds. Table 3.1 shows the results of this experiment, displaying the number of samples required to produce a ranking and the average observed loss for each configuration. These results show that the ELrec algorithm correctly bounded the loss and that the ELadj algorithm required less samples than the ELrec algorithm, but did not correctly bound the expected loss (since the observed loss was greater than the loss bound HLambda .13 3.2 Evaluation on Real Datasets The test of real-world applicability is based on data drawn from several datasets relating to spacecraft design and the processing of science data gathered in the context of planetary exploration. The first two datasets we investigate relate to spacecraft design optimization problems in which the hypotheses we wish to rank are candidate solutions to the design problem. The third and last dataset we examine involves ranking various lossless image compression approaches based on their performance on a large set of terrestrial images collected by the spacecraft Galileo. Cost of evaluation is given in seconds for all empirical data 12. Note that in our formulation of the stepped means test for the PAC approaches, ffl is both the difference in the expected mean of successive hypotheses and the indifference interval of the algorithm. Thus, ffl plays the roles of both problem parameter and control parameter here. 13. One confusing point is that for identical hypothesis and ranking algorithm settings, one can observe a lower loss when ranking a larger number of hypotheses. This is because the algorithm first divides the loss over the number of pirwise comparisons. Thus, for the same overall error (or expected loss bound), with more hypotheses, the pairwise expected error (or loss) will be smaller if there are more hypotheses. The ranking loss is defined previously. Thus, it is possible for the observed loss to increase or decrease compared to the same settings with fewer hypotheses. 386 Efficient Heuristic Hypothesis Ranking k fl Lambda oe ffl TURNBULL P ACrec P ACadj3 0.75 2 62 (0.88) 55 (0.95) 38 (0.78) 3 0.75 3 117 (0.89) 101 (0.86) 49 (0.80) 3 0.90 2 97 (0.96) 86 (0.94) 58 (0.92) 3 0.90 3 183 (0.99) 152 (0.96) 96 (0.89) 3 0.95 2 130 (0.97) 122 (0.97) 89 (0.97) 3 0.95 3 231 (0.96) 204 (0.95) 146 (0.94) 5 0.75 2 177 (0.83) 165 (0.95) 105 (0.87) 5 0.75 3 321 (0.95) 314 (0.93) 161 (0.75) 5 0.90 2 245 (0.98) 245 (0.97) 163 (0.91) 5 0.90 3 445 (0.98) 409 (0.91) 290 (0.92) 5 0.95 2 299 (0.98) 294 (0.98) 216 (1.00) 5 0.95 3 541 (0.98) 538 (0.98) 377 (0.92) 10 0.75 2 558 (0.92) 624 (0.91) 345 (0.85) 10 0.75 3 1,015 (0.94) 1,042 (0.95) 635 (0.83) 10 0.90 2 700 (0.97) 742 (0.96) 523 (0.91) 10 0.90 3 1,254 (0.97) 1,359 (0.97) 883 (0.90) 10 0.95 2 821 (1.00) 877 (0.97) 661 (0.94) 10 0.95 3 1,462 (0.99) 1,569 (0.98) 1,164 (0.93) Table 1: Estimated expected total number of observations by PAC algorithms in the stepped means configuration. Achieved probability of correct ranking is shown in parenthesis. Parameters ELrec ELadj k ffl H Lambda Samples Loss Samples Loss 3 2 1.0 96 0.6 43 1.2 3 2 0.75 102 0.5 56 1.0 3 2 0.5 139 0.2 73 0.6 3 2 0.25 235 0.1 139 0.4 5 2 1.0 320 0.7 140 1.3 5 2 0.75 343 0.4 169 1.2 5 2 0.5 464 0.4 247 0.7 5 2 0.25 575 0.2 350 0.5 10 2 1.0 1,136 0.5 572 1.4 10 2 0.75 1,325 0.5 668 1.1 10 2 0.5 1,533 0.3 872 0.7 10 2 0.25 1,856 0.1 1,153 0.4 Table 2: Estimated expected total number of observations of EL algorithms in stepped means configuration. Observed average loss of produced rankings. 387 Chien, Stechert, & Mutz because, unlike the synthetic problems, the cost of sampling a hypothesis is not constant in these domains. Table 3 gives a summary of the three ranking problems we considered. Dataset fixed parameters random variables optimization criteria DS-2 Penetrator penetrator diameter impact orientation maximize penetration probability penetrator length impact velocity maximize penetration depth soil density DS-2 Aeroshell fore body overlap stagnation pressure coef. minimize weight nose cone angle achieve target entry velocity bluntness ratio fillet radius outer diameter tail geometry Lossless Image Comp. compression method randomly selected test image maximize compression ratio Table 3: Description of datasets used for algorithm evaluation. 3.2.1 DS-2 Penetrator The goal of the New Millennium Deep Space Two (DS-2) mission is to deliver a pair of microprobes to the planet Mars for scientific study of the Martian soil. The probes will be released from orbit, travel through the Martian atmosphere, and embed themselves in the soil near the southern polar ice cap. The primary science objectives for the mission are (Balacuit., 1997): ffl to determine if ice is present below the surface of Mars, ffl to measure the local atmospheric pressure, ffl and to characterize the thermal properties of the Martian subsurface soil. The goal of this spacecraft design problem is to determine a good set of physical dimensions for the penetrator--a small, robust probe designed to impact the surface at extremely high velocity and to operate in the extreme cold. Specifically, we use design and simulation data from the DS-2 mission penetrator design. For our casting of the design problem, we hold the shape of the penetrator constant and generate design candidates based on different values for the variables of penetrator diameter and length. For a specific design a sample is taken by acquiring impact orientation, impact velocity, and soil density from a parameterized multivariate distribution and then calling a complex physical simulation to determine if and to what depth the penetrator bored into the Martian surface. The goal of the penetrator design problem is to determine the physical dimensions of the penetrator that maximize the probability of penetration, and in cases of penetration, maximize penetration depth. Tables 4 and 5 show the results of applying the PAC-based, Turnbull, and expected loss algorithms to a ranking problem in which the system is requested to rank 10 penetrator designs.14 In this problem the utility function is the depth of penetration of the penetrator, 14. "True" expected utility values were computed by performing 20,000 samples and using the sample mean for this large sample as ground truth. These expected utilities were then used to compute PAC ffl-validity of rankings and observed loss using the provided definitions. 388 Efficient Heuristic Hypothesis Ranking with those cases in which the penetrator does not penetrate being assigned zero utility. As shown in Table 4, both PAC algorithms significantly outperformed the Turnbull algorithm, which is to be expected because the hypotheses are somewhat correlated (via impact orientations and soil densities). Table 5 shows that the ELrec expected loss algorithm effectively bounded actual loss but the ELadj algorithm was inconsistent. k fl Lambda oe ffl TURNBULL P ACrec P ACadj10 0.75 2 534 (0.96) 144 (1.00) 92 (0.98) 10 0.90 2 667 (0.98) 160 (1.00) 98 (1.00) 10 0.95 2 793 (0.99) 177 (1.00) 103 (0.99) Table 4: Estimated expected total number of observations to rank DS-2 spacecraft designs. Achieved probability of correct ranking is shown in parenthesis. Parameters ELrec ELadj k H Lambda Samples Loss Samples Loss 10 0.10 152 0.05 77 0.14 10 0.05 200 0.03 90 0.06 10 0.02 378 0.03 139 0.03 Table 5: Estimated expected total number of observations and expected loss of an incorrect ranking of DS-2 penetrator designs. 3.2.2 DS-2 Aeroshell Design Ranking The objective of this problem is to design an aeroshell for the soil penetrator described in the previous section that gives the appropriate entry velocity with minimum weight. Design candidates are defined by six continuous variables that represent various geometric quantities: the extent to which the fore body overlaps the aftbody, nose cone angle, bluntness ratio, fillet radius, outer diameter, and the tail geometry. Candidate designs (hypotheses) are evaluated by running a simple physical simulation of the aeroshell's behavior. Such a sample is taken by running the simulation with the fixed design variables of the hypothesis and a value for the stagnation pressure coefficient taken from a normal distribution. The simulation computes values for the achieved entry velocity and the mass of the aeroshell; then the weighted sum of the reciprocals of these values is maximized. We give the results of ranking three, five, and ten hypotheses using the Turnbull, PAC, and expected loss algorithms in Tables 6 and 7.15 As in the previous experiment, the PAC-based algorithms outperformed the Turnbull algorithm in all cases. While the P ACadj algorithm represents a significant increase in 15. Again, deep sampling (500 samples) was performed to obtain the "correct" ranking, against which these algorithms are compared. 389 Chien, Stechert, & Mutz performance here, we note that it did not achieve the desired level of confidence in all cases; both the Turnbull and P ACrec algorithms did achieve the required confidence. k fl Lambda oe ffl TURNBULL P ACrec P ACadj 3 0.75 2 8.9 (1.00) 8.4 (1.00) 3.5 (1.00) 3 0.75 3 22.9 (1.00) 11.3 (1.00) 3.8 (1.00) 3 0.90 2 17.1 (1.00) 14.0 (1.00) 7.1 (1.00) 3 0.90 3 38.2 (1.00) 18.6 (1.00) 7.2 (1.00) 3 0.95 2 22.6 (1.00) 21.6 (1.00) 7.1 (1.00) 3 0.95 3 52.0 (1.00) 32.1 (1.00) 7.3 (1.00) 5 0.75 2 29.1 (0.92) 20.1 (0.94) 11.8 (0.91) 5 0.75 3 69.0 (1.00) 33.9 (0.96) 11.7 (0.91) 5 0.90 2 42.4 (0.99) 30.0 (0.93) 11.7 (0.91) 5 0.90 3 94.7 (0.99) 54.8 (0.96) 11.8 (0.84) 5 0.95 2 51.9 (0.98) 43.6 (1.00) 11.7 (0.91) 5 0.95 3 117.9 (0.99) 81.5 (0.99) 11.5 (0.92) 10 0.75 2 84.0 (0.99) 42.0 (0.94) 22.1 (0.92) 10 0.75 3 196.6 (1.00) 57.9 (0.96) 22.1 (0.90) 10 0.90 2 112.0 (0.98) 53.8 (0.98) 22.6 (0.89) 10 0.90 3 252.9 (0.99) 85.5 (1.00) 21.6 (0.91) 10 0.95 2 129.1 (1.00) 61.3 (0.97) 20.6 (0.90) 10 0.95 3 315.7 (1.00) 125.7 (1.00) 20.4 (0.92) Table 6: Estimated expected cost (in seconds) to rank aeroshell designs. Achieved probability of correct ranking is shown in parenthesis. Parameters ELrec ELadj k H Lambda Execution Cost Loss Execution Cost Loss 3 20 9.5 4.3 7.9 3.4 3 30 7.6 3.4 7.3 3.7 3 40 7.3 4.1 6.9 2.7 5 20 21.7 7.0 7.2 8.6 5 30 18.1 12.0 6.4 12.4 5 40 15.0 9.3 10.5 8.5 10 20 55.3 9.7 18.3 7.9 10 30 42.6 8.9 14.2 9.8 10 40 38.2 10.4 13.1 9.6 Table 7: Estimated expected cost (in seconds) and expected loss of an incorrect ranking of DS-2 aeroshell designs. 390 Efficient Heuristic Hypothesis Ranking 3.2.3 Lossless Image Compression on Galileo Image Data This problem utilizes a large set of raw image data acquired by the Galileo spacecraft. Each of the images is 256 by 256 in size and is made up of greyscale pixels ranging from 0 to 255 in intensity. The goal is to select the lossless compression method16 that performs best on this class of images. The performance of an image compression algorithm on a particular image could be measured in a number of ways. For example, execution time, compression ratio, and image quality (in the case where lossy compression methods are being considered) could define algorithm performance. In our tests we chose to consider only the compression ratio achieved by a given compression method as our utility function. To sample each method (hypothesis), an image is randomly selected, the method is applied to that image, and the achieved compression ratio is recorded. Given below (Tables 8 and 9) are the results of ranking three, five, and seven hypotheses using the Turnbull, PAC, and expected loss algorithms. Ranking correctness was determined by comparison to a "correct" ranking established by sampling each compression method on a set of 1500 distinct images. We again note the substantial performance improvement the PAC-based algorithms have over the Turnbull algorithm. Although both the Turnbull algorithm and the PAC algorithms (Table 8) achieved the desired confidence level, the adjacent version of the EL algorithm (Table 9) failed to bound the loss to the specified level in over half the cases. It is interesting to consider the results presented in this section in light of the fact that each of the statistical techniques being used makes some form of normality assumption. In fact, all three of the problem domains we investigate have some number of hypotheses whose utility functions are not normally distributed. From past experience it is known that utility functions in the DS-2 Penetrator domain (Section 3.2.1) are highly non-normal; Figure 4 illustrates the difference between data that is normally distributed and data that is not. 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.02 0.04 0.06 0.08 0.1 0.12 0 50 100 150 200 250 300 350 400 450 500 Figure 4: A comparison of (a) data that is normally distributed with high likelihood and (b) data that is very likely not normally distributed. In each case, the histogram of experimental data is shown in solid boxes; data drawn from a normal distribution with the same mean and standard deviation is shown with dashed lines. To determine the extent to which the utilities of hypotheses in the remaining two domains are normally distributed we applied the Kolmogorov-Smirnov test (see Appendix A 16. The seven compression methods we considered were: CALIC, lossless JPEG, GIF, TIFF, pack, gzip, and compress. 391 Chien, Stechert, & Mutz k fl Lambda oe ffl TURNBULL P ACrec P ACadj 3 0.90 2 62.8 (1.00) 30.1 (1.00) 14.8 (1.00) 3 0.90 3 150.8 (1.00) 30.5 (1.00) 14.8 (1.00) 3 0.95 2 84.6 (1.00) 28.6 (1.00) 15.0 (1.00) 3 0.95 3 206.8 (1.00) 29.0 (1.00) 20.5 (1.00) 3 0.99 2 142.0 (1.00) 30.1 (1.00) 23.3 (1.00) 3 0.99 3 359.4 (1.00) 30.6 (1.00) 23.2 (1.00) 5 0.90 2 134.7 (1.00) 39.5 (1.00) 29.9 (1.00) 5 0.90 3 329.9 (1.00) 39.9 (1.00) 30.0 (1.00) 5 0.95 2 176.1 (1.00) 39.3 (1.00) 29.8 (1.00) 5 0.95 3 399.8 (1.00) 39.3 (1.00) 29.6 (1.00) 5 0.99 2 249.3 (1.00) 39.2 (1.00) 29.9 (1.00) 5 0.99 3 598.1 (1.00) 39.2 (1.00) 30.7 (1.00) 7 0.90 2 210.8 (1.00) 35.6 (1.00) 37.2 (1.00) 7 0.90 3 499.3 (1.00) 35.7 (1.00) 34.5 (1.00) 7 0.95 2 250.3 (1.00) 37.4 (1.00) 35.6 (1.00) 7 0.95 3 608.7 (1.00) 36.0 (1.00) 35.0 (1.00) 7 0.99 2 339.6 (1.00) 36.5 (1.00) 34.5 (1.00) 7 0.99 3 813.7 (1.00) 37.2 (1.00) 35.3 (1.00) Table 8: Estimated expected cost (in seconds) to rank lossless image compression approaches on Galileo image data. Achieved probability of correct ranking is shown in parenthesis. Parameters ELrec ELadj k H Lambda Execution Cost Loss Execution Cost Loss 3 10 31.7 0.0 24.9 0.0 3 5 32.5 0.0 24.9 0.0 3 1 33.7 0.0 24.9 0.0 5 10 80.6 0.0 32.7 17.4 5 5 83.5 0.0 33.7 69.4 5 1 101.0 0.0 32.3 49.4 7 10 99.5 0.0 42.3 17.4 7 5 105.7 0.0 33.3 34.7 7 1 119.8 0.0 30.4 86.8 Table 9: Estimated expected cost (in seconds) and expected loss of an incorrect ranking of DS-2 penetrator designs. 392 Efficient Heuristic Hypothesis Ranking for details). The test determined that none of the ten hypotheses from the DS-2 Aeroshell domain (Section 3.2.2) had normally distributed utility. Additionally, only two of the seven hypotheses from the image compression domain (Section 3.2.3) were shown to have greater than 90% likelihood of having normally distributed utility functions17. For these reasons, evaluating the ranking strategies on these datasets provides a particularly strong test of the applicability of the techniques. We draw the reader's attention to the particularly large disparity in performance between the Turnbull algorithm and the PAC-based algorithms in the image compression domain, especially apparent when the number of hypotheses, and the confidence level, are high. Additionally, this problem domain has two hypotheses with normally distributed utility and five that are non-normal. These observations suggest that the PAC-based algorithms perform better (in relative terms) when faced with a domain that violates the assumption of normality. 4. Discussion and Conclusions There are a number of areas of related work. First, there has been considerable analysis of hypothesis selection problems. Selection problems have been formalized using a Bayesian framework (Moore & Lee, 1994; Rivest & Sloan, 1988) that does not require an initial sample, but uses a rigorous encoding of prior knowledge. Howard (Howard, 1970) also details a Bayesian framework for analyzing learning cost for selection problems. If one uses a hypothesis selection framework for ranking, allocation of pairwise errors can be performed rationally (Gratch et al., 1994). Reinforcement learning work (Kaelbling, 1993) with immediate feedback can also be viewed as a hypothesis selection problem. The framework presented invites future work in a number of directions. Currently, the stopping criteria used are relaxations of the ranking requirement. Another approach that could be used is to bound the resources available for ranking. Limiting the number of samples where sample cost is high and limiting the time of computation (so that we have an anytime algorithm) are two straightforward application areas. Another area for future work is discovery of composite strategies or hypotheses. Thus far we have examined ranking (and in other articles, selection) of a hypothesis with highest expected value over an entire distribution. For example, learning a scheduling control strategy that will do well over a distribution of problems. However, it is likely that for most distributions of problems, there exists a composite strategy which would outperform any single strategy. For example, a single strategy might be to apply method A to solve a problem. A composite strategy would be, test the problem for feature X, if X true apply method A, else apply method B. These composite strategies correspond to algorithm portfolios as named in Operations Research. Indeed the results of applying methods could also be viewed as strategies. One might have the composite strategy of trying method A for 10 CPU seconds, then if that fails trying method B. Of course, in all these composition and portfolio approaches, the difficulty isefficiently proposing and evaluating plausible compositions. For even a small set of base strategies the number of copositions is enormous. 17. For reference, the data in Figure 4 (a) was normally distributed with 97.5% likelihood, according to the Kolmogorov-Smirnov test. 393 Chien, Stechert, & Mutz In summary, this paper has described the hypothesis ranking problem, an extension to the hypothesis selection problem. We defined the application of two decision criteria, probably approximately correct and expected loss, to this problem. We then defined two families of algorithms, recursive selection and adjacency, for solution of hypothesis ranking problems. Finally, we demonstrated the effectiveness of these algorithms on both synthetic and realworld datasets, documenting improved performance over existing statistical approaches. Acknowledgments This work was performed by the Jet Propulsion Laboratory, California Institute of Technology, under contract with the National Aeronautics and Space Administration. Appendix A. Applying the K-S Test to Real Datasets The Kolmogorov-Smirnov Test is a statistical means of accepting, with a certain level of confidence, the hypothesis that some sampleset fits a parametric distribution with a given set of parameters. The method compares the CDF generated by the empirical distribution to that of the corresponding parametric distribution (i.e., with estimated parameters). The K-S test gives a confidence based on the maximum, D, of the discrepancies between these two CDFs: D = maxjF1(x) Gamma F2(x)j For our purposes we wish to determine, for each hypothesis in a given domain, whether the values of the utility function are normally distributed or not. In each case, half of the utility samples taken are used to compute the mean and standard deviation of the normal; the remaining half are used to compute the CDF. A.1 DS-2 Penetrator 20000 samples taken. design number maxjF1(x) Gamma F2(x)j normally distributed? 1 0.1415 o/ 90% likely 2 0.1202 o/ 90% likely 3 0.1020 o/ 90% likely 4 0.1261 o/ 90% likely 5 0.1207 o/ 90% likely 6 0.1261 o/ 90% likely 7 0.1020 o/ 90% likely 8 0.1493 o/ 90% likely 9 0.1461 o/ 90% likely 10 0.1261 o/ 90% likely 394 Efficient Heuristic Hypothesis Ranking A.2 DS-2 Aeroshell Design Ranking 500 samples taken. design number maxjF1(x) Gamma F2(x)j normally distributed? 1 0.08 ! 90% likely 2 0.08 ! 90% likely 3 0.08 ! 90% likely 4 0.08 ! 90% likely 5 0.08 ! 90% likely 6 0.08 ! 90% likely 7 0.08 ! 90% likely 8 0.08 ! 90% likely 9 0.08 ! 90% likely 10 0.08 ! 90% likely A.3 Lossless Image Compression on Galileo Image Data 200 samples taken. compression method maxjF1(x) Gamma F2(x)j normally distributed? gif 0.10 90% likely compress 0.14 ! 90% likely calic 0.19 o/ 90% likely gzip 0.09 97.5% likely jpegls 0.18 o/ 90% likely pack 0.12 ! 90% likely tiff 0.11 ! 90% likely References Balacuit., C. P. (1997). Deep Space 2 - Mars Microprobe Home Page (mission objectives statement). Tech. rep. http://nmp.jpl.nasa.gov/ds2, NASA/JPL. Chien, S. A., Gratch, J. M., & Burl, M. C. (1995). On the Efficient Allocation of Resources for Hypothesis Evaluation: A Statistical Approach. IEEE Trans. Pattern Analysis and Machine Intelligence, 17 (7), 652-665. Chien, S. A., Stechert, A. D., & Mutz, D. H. (1998). Efficient Heuristic Ranking of Hypotheses. In Advances in Neural Information Processing Systems 10 (Jordan, Kearns, and Solla eds.), pp. 444-450 Denver, Colorado. NIPS. Ginsberg, M., & Harvey, W. (1992). Iterative Broadening. Artificial Intelligence Journal, 55, 367-383. 395 Chien, Stechert, & Mutz Goldberg, D. (1989). Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley. Gratch, J. (1992). COMPOSER: A Probabilistic Solution to the Utility Problem in Speed-up Learning. In Proceedings of the Tenth National Conference on Artificial Intelligence, pp. 235-240 San Jose, CA. AAAI. Gratch, J. (1993). COMPOSER: A Decision-theoretic Approach to Adaptive Problem Solving. Tech. rep. UIUCDCS-R-93-1806, Department of Computer Science, University of Illinois. Gratch, J., Chien, S., & DeJong, G. (1994). Improving Learning Performance Through Rational Resource Allocation. In Proceedings of the Twelfth National Conference on Artificial Intelligence, pp. 576-582 Seattle, WA. AAAI. Greiner, R., & Jurisica, I. (1992). A Statistical Approach to Solving the EBL Utility Problem. In Proceedings of the Tenth National Conference on Artificial Intelligence, pp. 241-248 San Jose, CA. AAAI. Haseeb, R. M. (1985). Modern Statistical Selection. American Sciences Press, Columbus, OH. Howard, R. A. (1970). Decision Analysis: Perspectives on Inference, Decision, and Experimentation. Proceedings of the IEEE, 58 (5), 823-834. Kaelbling, L. P. (1993). Learning in Embedded Systems. MIT Press, Cambridge, MA. Minton, S. (1988). Learning Search Control Knowledge: An Explanation-Based Approach. Kluwer Academic Publishers, Norwell, MA. Moore, A. W., & Lee, M. S. (1994). Efficient Algorithms for Minimizing Cross-Validation Error. In Proceedings of the International Conference on Machine Learning New Brunswick, MA. Musick, R., Catlett, J., & Russell, S. (1993). Decision Theoretic Subsampling for Induction on Large Databases. In Proceedings of the International Conference on Machine Learning, pp. 212-219 Amherst, MA. Rivest, R. L., & Sloan, R. (1988). A New Model for Inductive Inference. In Proceedings of the Second Conference on Theoretical Aspects of Reasoning about Knowledge. Russell, S., & Wefald, E. (1992). Do the Right Thing: Studies in Limited Rationality. MIT Press, Cambridge, MA. Tadepalli, P. (1992). A theory of unsupervised speedup learning. In Proc. of the Tenth National Conference on Artificial Intelligence, pp. 229-234 San Jose, CA. AAAI. Turnbull, B. W., & Weiss, L. I. (1984). A Class of Sequential Procedures for k-sample Problems Concerning Normal Means with Unknown Equal Variances. In Santner, T. J., & Tamhane, A. C. (Eds.), Design of Experiments: Ranking and Selection, pp. 225-240. Marcel Dekker. 396 Efficient Heuristic Hypothesis Ranking Valiant, L. G. (1984). A Theory of the Learnable. Communications of the ACM, 27, 1134-1142. 397