T. Elomaa and M. Kaariainen

Volume 15, 2001

**Links to Full Text:**

Top-down induction of decision trees has been observed to suffer from the inadequate functioning of the pruning phase. In particular, it is known that the size of the resulting tree grows linearly with the sample size, even though the accuracy of the tree does not improve. Reduced Error Pruning is an algorithm that has been used as a representative technique in attempts to explain the problems of decision tree learning.

In this paper we present analyses of Reduced Error Pruning in three different settings. First we study the basic algorithmic properties of the method, properties that hold independent of the input decision tree and pruning examples. Then we examine a situation that intuitively should lead to the subtree under consideration to be replaced by a leaf node, one in which the class label and attribute values of the pruning examples are independent of each other. This analysis is conducted under two different assumptions. The general analysis shows that the pruning probability of a node fitting pure noise is bounded by a function that decreases exponentially as the size of the tree grows. In a specific analysis we assume that the examples are distributed uniformly to the tree. This assumption lets us approximate the number of subtrees that are pruned because they do not receive any pruning examples.

This paper clarifies the different variants of the Reduced Error Pruning algorithm, brings new insight to its algorithmic properties, analyses the algorithm with less imposed assumptions than before, and includes the previously overlooked empty subtrees to the analysis.

FAQ | Best Paper | WWW AI Resources | Comments | Notify Me of New Articles

© Copyright 1993-2006 AI Access Foundation, Inc.

Journal of Arti034cial Intelligence Research 15 {2001} 163-187 Submitted 10/00; published 09/01 An Analysis of Reduced Error Pruning Tapio Elomaa elomaa@cs.helsinki.fi Matti K344344ri344inen matti.kaariainen@cs.helsinki.fi Department of Computer Science P. O. Box 26 {Teollisuuskatu 23} FIN-00014 University of Helsinki, Finland Abstract Top-down induction of decision trees has been observed to su033er from the inadequate functioning of the pruning phase. In particular, it is known that the size of the resulting tree grows linearly with the sample size, even though the accuracy of the tree does not improve. Reduced Error Pruning is an algorithm that has been used as a representative techniquein attempts to explain the problems of decision tree learning. In this paper we present analyses of Reduced Error Pruning in three di033erent settings. First we study the basic algorithmic properties of the method, properties that hold inde- pendent of the input decision tree and pruning examples. Then we examine a situation that intuitively should lead to the subtree under consideration to be replaced by a leaf node, one in which the class label and attribute values of the pruning examples are independent of each other. This analysis is conducted under two di033erent assumptions. The general analysis shows that the pruning probability of a node 034tting pure noise is bounded by a function that decreases exponentially as the size of the tree grows. In a speci034c analysis we assume that the examples are distributed uniformly to the tree. This assumption lets us approximate the number of subtrees that are pruned because they do not receive any pruningexamples. This paper clari034es the di033erent variants of the Reduced Error Pruning algorithm, brings new insight to its algorithmic properties, analyses the algorithm with less imposed assumptions than before, and includes the previously overlooked empty subtrees to the analysis. 1. Introduction Decision tree learning is usually a two-phase process {Breiman, Friedman, Olshen, & Stone, 1984; Quinlan, 1993}. First a tree re035ecting the given sample as faithfully as possible is constructed. If no noise prevails, the accuracy of the tree is perfect on the training examples that were used to build the tree. In practice, however, the data tends to be noisy, which may introduce contradicting examples to the training set. Hence, 100045 accuracy cannot necessarily be obtained even on the training set. In any case, the resulting decision tree is over034tted to the sample; in addition to the general trends of the data, itencodes the peculiarities and particularities of the training data, which makes it a poor predictor of the class label of future instances. In the second phase of induction, the decision tree is pruned in order to reduce its dependency on the training data. Pruning aims at removing from the tree those parts that are likely to only be due to the chance properties of the training set. The problems of the two-phased top-down induction of decision trees are well-known and have been extensively reported {Catlett, 1991; Oates & Jensen, 1997, 1998}. The size c fl2001 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved.Elomaa & K344344ri344inen of the tree grows linearly with the size of the training set, even though after a while no accuracy is gained through the increased tree complexity. Obviously, pruning is intended to 034ght this e033ect. Another defect is observed when the data contains no relevant attributes; i.e., when the class labels of the examples are independent of their attribute values. Clearly, a single-node tree predicting the ma jority label of the examples should result in this case, since no help can be obtained by querying the attribute values. In practice, though, often largedecision trees are built from such data. Many alternative pruning schemes exist {Mingers, 1989a; Esposito, Malerba, & Semer- aro, 1997; Frank, 2000}. They di033er, e.g., on whether a single pruned tree or a series of pruned trees is produced, whether a separate set of pruning examples is used, which aspects {classi034cation error and tree complexity} are taken into account in pruning decisions, how these aspects are determined, and whether a single scan through the tree su036ces or whether iterative processing is required. The basic pruning operation that is applied to the tree is the replacement of an internal node together with the subtree rooted at it with a leaf. Also more elaborated tree restructuring operations are used by some pruning techniques {Quinlan,1987, 1993}. In this paper, the only pruning operation that is considered is the replacementof a subtree by the majority leaf, i.e., a leaf labeled by the ma jority class of the examples reaching it. Hence, a pruning of a tre e is a subtree of the original tree with just zero, one, or more internal nodes changed into leaves. Re duc e d Error Pruning {subsequently rep for short} was introduced by Quinlan {1987} in the context of decision tree learning. It has subsequently been adapted to rule set learning as well {Pagallo & Haussler, 1990; Cohen, 1993}. rep is one of the simplest pruning strategies. In practical decision tree pruning rep is seldom used, because it has the disadvantage of requiring a separate set of examples for pruning. Moreover, it is considered too aggressive a pruning strategy that overprunes the decision tree, deleting relevant parts from it {Quinlan, 1987; Esposito et al., 1997}. The need for a pruning set is often considered harmful because of the scarceness of the data. However, in the data mining context the examples are often abundant and setting a part of them aside for pruning purposes presents no problem. Despite its shortcomings rep is a baseline method to which the performance of other pruning algorithms is compared {Mingers, 1989a; Esposito,Malerba, & Semeraro, 1993; Esposito et al., 1997}. It presents a good starting point for understanding the strengths and weaknesses of the two-phased decision tree learning and o033ers insight to decision tree pruning. rep has the advantage of producing the smallest pruning among those that are the most accurate with respect to the pruning set. Recently, Oates and Jensen {1999} analyzed repin an attempt to explain why and when decision tree pruning fails to control the growth of the tree, even though the data do not warrant the increased size. We approach the same sub ject, but try to avoid restricting the analysis with unnecessary assumptions. We also consider an explanation for the unwarranted growth of the size of the decision tree. In this paper we analyze rep in three di033erent settings. First, we explore the basic algo- rithmic properties of rep, which apply regardless of the distribution of examples presented to the learning algorithm. Second, we study, in a probabilistic setting, the situation in which the attribute values are independent of the classi034cation of an example. Even though this pure noise 034tting situation is not expected to arise when the whole pruning set is considered, it is encountered at lower levels of the tree, when all relevant attributes have already been exhausted. We further assume that all subtrees receive at least one pruning example, sothat 164An Analysis of Reduced Error Pruning none of them can be directly pruned due to not receiving any examples. The class value is also assigned at random to the pruning examples. In our third analysis it is assumed that each pruning example has an equal chance to end up in any one of the subtrees of the tree being pruned. This rather theoretical setting lets us take into account those subtrees that do not receive any examples. They have been left without attention in earlier analyses. The rest of this paper is organized as follows. The next section discusses the di033erent versions of the rep algorithm and 034xes the one that is analyzed subsequently. In Section 3 we review earlier analyses of rep. Basic algorithmic properties of rep are examined in Section 4. Then, in Section 5, we carry out a probabilistic analysis of rep, without making any assumptions about the distribution of examples. We derive a bound for the pruning probability of a tree which depends exponentially on the relation of the number of pruning examples and the size of the tree. Section 6 presents an analysis, which assumes that the pruning examples distribute uniformly to the subtrees of the tree. This assumption lets us sharpen the preceding analysis on certain aspects. However, the bounds of Section 5 hold with certainty, while those of Section 6 are approximate results. Further related research is brie035y reviewed in Section 7 and, 034nally, in Section 8 we present the concluding remarks of this study. 2. Reduced Error Pruning Algorithm rep was never introduced algorithmically by Quinlan {1987}, which is a source of much confusion. Even though rep is considered and appears to be a very simple, almost trivial, algorithm for pruning, there are many di033erent algorithms that go under the same name. No consensus exists whether rep is a bottom-up algorithm or an iterative method. Neither is it obvious whether the training set or pruning set is used to decide the labels of the leaves that result from pruning. 2.1 High-Level Control Quinlan's {1987, p. 225025226} original description of rep does not clearly specify the pruning algorithm and leaves room for interpretation. It includes, e.g., the following characteriza- tions. 020For every non-leaf subtree S of T we examine the change in misclassi034cations over the test set that would occur if S werereplaced by the best possible leaf. If the new tree would give an equal or fewer number of errors and S containsno subtree with the same property, S is replaced by the leaf. The process continues until any further replacements would increase the number of errors over the test set. [...] the 034nal tree is the most accurate subtree of the original tree with respect to the test set and the smallest tree with that accuracy.021 Quinlan {1987, p. 227} also later continues to give the following description. 020This method [pessimistic pruning] has two advantages. It is much faster than either of the preceding methods [cost-complexity and reduced error pruning] since each subtree is examined at most once.021 165Elomaa & K344344ri344inen On one hand this description requires the nodes to be processed in a bottom-up manner, since subtrees must be checked for 020the same property021 before pruning a node but, on the other hand, the last quotation would indicate rep to be an iterative method. We take rep to have the following single-scan bottom-up control strategy like in most other studies {Oates & Jensen, 1997, 1998, 1999; Esposito et al., 1993, 1997; Kearns & Mansour, 1998}. Nodes are pruned in a single bottom-up sweep through the decision tree, prun- ing each node is considered as it is encountered. The nodes are processed in postorder. By this order of node processing, any tree that is a candidate for pruning itself cannot contain a subtree that could still be pruned without increasing the tree's error. Due to the ambiguity of rep's de034nition, a di033erent version of rep also lives on {Mingers, 1989a; Mitchell, 1997}. It is probably due to Mingers' {1989} interpretation ofQuinlan's ambiguous de034nition. Nodes are pruned iteratively, always choosing the node whose removal most increases the decision tree accuracy over the pruning set. The process continues until further pruning is harmful. However, this algorithm appears to be incorrect. Esposito et al. {1993, 1997} have shown that a tree produced by this algorithm does not meet the ob jective of being the most accurate subtree with respect to the pruning set. Moreover, this algorithm overlooks the explicit requirement of checking whether a subtree would lead to reduction of the classi034cation error. Other iterative algorithms could be induced from Quinlan's original description. How- ever, if the explicit requirement of checking whether a subtree could be pruned before prun- ing a supertree is obeyed, then these versions of rep will all reduce to the more e036cient bottom-up algorithm. 2.2 Leaf Labeling Another source of confusion in Quinlan's {1987} description of rep is that it is not clearly speci034ed how to choose the labels for the leaves that are introduced to the tree through pruning. Oates and Jensen {1999} interpreted that the intended algorithm would label the new leaves according to the ma jority class of the training examples, but themselves analyzed a version of the algorithm where the new leaves obtain as their labels the ma jority of the pruning examples. Oates and Jensen motivated their choice by the empirical observation that in practice there is very little di033erence between choosing the leaf labels in either way. However, choosing the labels of pruned leaves according to the ma jority of pruning examples will set such leaves into a di033erent status than the original leaves, which have as their label the ma jority class of training examples. Example Figure 1 shows a decision tree that will be pruned into a single leaf if the training examples are used to label pruned leaves. A negative leaf replaces the root of the tree and makes two mistakes on the pruning examples, while the original tree makes three mistakes. With this tree we can illustrate an important di033erence in using training and 166An Analysis of Reduced Error Pruning Tapio Elomaa0/10/12/02/0--+-+-+Figure 1: A {part of a} decision tree. The labels inside the nodes denote the ma jority classes of training examples arriving to these nodes. For leaves the numbers of pruning examples from the two classes are also given. x=y means that x negative and y positive instances reach the leaf. pruning examples to label pruned leaves. Using training examples and proceeding bottom- up, observe that neither subtree is pruned, since the left one replaced with a negative leaf would make two mistakes instead of the original one mistake. Similarly, the right subtree replaced with a positive leaf would result in an increased number of classi034cation errors. Nevertheless, the root node 026 even though its subtrees have not been pruned 026 can still be pruned. When pruning examples are used to label pruned leaves, anode with two non-trivial subtrees cannot be pruned unless both its subtrees are collapsed into leaves. The next section will prove this. In the tree of Figure 1 both subtrees would be collapsed into zero- error leaves. However, in this case the root node will not be pruned. A further possibility for labeling the leaf nodes would be to take both training and pruning examples into account in deciding the label of a pruned leaf. Depending on the relation of the numbers of training and pruning examples this strategy resembles one or the other of the above-described approaches. Usually the training examples are more numerous than the pruning examples, and will thus dominate. In practice it is impossible to discern this labeling strategy from that of using the ma jority of training examples. 2.3 Empty Subtrees Since rep uses di033erent sets of examples to construct and to prune a decision tree, itis possible that some parts of the tree do not receive any examples in the pruning phase. Such parts of the decision tree, naturally, can be replaced with a single leaf without changing the number of classi034cation errors that the tree makes on the pruning examples. In other words, subtrees that do not obtain any pruning examples are always pruned. Quinlan {1987} already noted that the parts of the original tree that correspond to rarer special cases, which are not represented in the pruning set, may be excised. 167Elomaa & K344344ri344inen DecisionTree REP{ DecisionTree T, ExampleArray S } { for { i = 0 to S.length-1 } classify{ T, S[i] }; return prune{ T }; } void classify{ DecisionTree T, Example e } { T.total++; if { e.label == 1 } T.pos++; // update node counters if { !leaf{T} } if { T.test{e} == 0 } classify{ T.left, e }; else classify{ T.right, e }; } int prune{ DecisionTree T } // Output classification error after pruning T { if { leaf{T} } if { T.label == 1 } return T.total - T.pos; else return T.pos; else { error = prune{ T.left } + prune{ T.right }; if { error < min{ T.pos, T.total - T.pos } } return error; else { replace T with a leaf; if { T.pos > T.total - T.pos } { T.label = 1; return T.total - T.pos; } else { T.label = 0; return T.pos; } } } } Table 1: The rep algorithm. The algorithm 034rst classi034es the pruning examples in a top- down pass using method classify and then during a bottom-up pass prunes the tree using method prune. Intuitively, it is not clear which is the best-founded strategy for handling empty subtre es, those that do not receive any examples. On one hand they obtain support from the training set, which usually is more numerous than the pruning set but, on the other hand, the fact that no pruning example corresponds to these parts of the tree would justify drawing the conclusion that these parts of the decision tree were built by chance properties of the training data. In rep, consistently with preferring smaller prunings also otherwise, the latter view is adopted. The problem of empty subtrees is connected to the problem of smal l disjuncts in machine learning algorithms {Holte, Acker, & Porter, 1989}. A small disjunct covers only a small number of the training examples. Collectively the small disjuncts are responsible for a small number of classi034cation decisions, but they accumulate most of the error of the whole concept. Nevertheless, small disjuncts cannot be eliminated altogether, without adversely a033ecting other disjuncts in the concept. 168An Analysis of Reduced Error Pruning 2.4 The Analyzed Pruning Algorithm Let us brie035y reiterate the details of the rep algorithm that is analyzed subsequently. As al- ready stated, the control strategy of the algorithm is the single-sweep bottom-up processing. First, a top-down traversal drives the pruning examples through the tree to the appropriate leaves. The counters of the nodes en route are updated. Second, during a bottom-up traver- sal the pruning operations indicated by the classi034cation errors are executed. The errors can be determined on the basis of the node counter values. In the bottom-up traversal each node is visited only once. The pruned leaves are labeled by the ma jority of the pruning set {see Table 1}. 3. Previous Work Pruning of decision trees has recently received a lot of analytical attention; existing pruning methods have been analyzed {Esposito et al., 1993, 1997; Oates & Jensen, 1997, 1998, 1999} and new analytically-founded pruning techniques have been developed {Helmbold & Schapire, 1997; Pereira & Singer, 1999; Mansour, 1997; Kearns& Mansour, 1998}. Also many empirical comparisons of pruning have appeared {Mingers, 1989a; Malerba, Esposito, & Semeraro, 1996; Frank, 2000}. In this section we review earlier work that concerns the repalgorithm. Further related research is considered in Section 7. Esposito et al. {1993} viewed the rep algorithm, among other pruning methods, as a search process in the state space. In addition to noting that the iterative version of rep cannot produce the optimal result required by Quinlan {1987}, they also observed that even though rep is a linear-time algorithm in the size of the tree, with respect to the height of the tree rep requires exponential time in the worst case. In their subsequent comparative analysis Esposito et al. {1997} sketched a proof for Quinlan's {1987} claim that the pruning produced by rep is the smallest among the most accurate prunings of the given decision tree. The bias of rep was brie035y examined by Oates and Jensen {1997, 1998}. They observed that the error, r L , of the best ma jority leaf that could replace a subtree T only depends on {the class distribution of } the examples that reach the root N of T . In other words, the tree structure above T and N decides the error r L . Let r T denote the error of the subtree T at the moment when the pruning sweep reaches N ; i.e., when some pruning may already have takenplace in T . All pruning operations performed in T have led either r T to decrease from the initial situation or to stay unchanged. In any case, pruning that has taken place in T potentially decreases r T , but does not a033ect r L . Hence, the probability that r T < r L 026 i.e., that T will not be pruned 026 increases through pruning in T . This error propagation bias is inherent to rep. Oates and Jensen {1997, 1998} conjecture that the larger the original tree and the smaller the pruning set, the larger this e033ect, because a large tree provides more pruning opportunities and the high variance of a small pruning set o033ers more random chances for r L 024 r T . Subsequently we study some of these e033ects exactly. In a follow-up study Oates and Jensen {1999} used rep as a vehicle for explaining the problems that have been observed in the pruning phase of top-down induction of decision trees. They analyzed rep in a situation in which the decision node under consideration 034ts noise 026 i.e., when the class of the examples is independent of the value of the attribute tested in the node at hand 026 and built a statistical model of rep in this situation. It indicates, 169Elomaa & K344344ri344inen consistently with their earlier considerations, that even though the probability of pruning a node that 034ts noise prior to pruning beneath it is close to 1, pruning that occurs beneath the node reduces its pruning probability close to 0. In particular, this model shows that if even one descendant of node N at depth d is not pruned, then N will not be pruned {assuming there are no leaves until depth d + 1}. The consequence of this result is that increasing depth d leads to an exponential decrease of the node's pruning probability. The 034rst part of Oates and Jensen's {1999} analysis is easy to comprehend, but its signif- icance is uncertain, because this situation does not rise in any bottom-up pruning strategy. The statistical model is based on the assumption that the number, n, of pruning instances that pass through the node under consideration is large, in which case 026 independence as- sumptions prevailing 026 the errors committed by the node can be approximated by the normal distribution. The expected error of the original tree is the mean of the distribution, while, if pruned to a leaf, the tree would misclassify a proportion of the n examples that corresponds to that of the minority class. Oates and Jensen show that the latter number is always less than the mean of the standard distribution of errors. Hence, the probability of pruning is over0.5 and approaches 1 as n grows. In the second part of the analysis, in considering the pruning probability of a node N after pruning has taken place beneath it, Oates and Jensen assume that the proportion of positive examples in any descendant of N at depth d is the same as in N . In this setting, assuming further that N has a positive ma jority, all its descendants at level d also have a positive ma jority. It directly follows that if all descendants at level d are pruned, they are all replaced by a positive leaf. Hence, the function represented by this pruning is identically positive. The ma jority leaf that would replace N also represents the same function and is smallerthan the above pruning. Therefore, rep will choose the single leaf pruning. On the other hand, if one or more of the descendants of N at depth d are not pruned, then the pruning of the tree rooted at N , in which these subtrees are maintained and all other nodes at level d are pruned into positive leaves, is more accurate than the ma jority leaf. In this casethe tree will not be pruned. Oates and Jensen {1999} also assume that starting from any node at level d the proba- bility of routing an example to a positive leaf is the same. In the following analyses we try to rid all unnecessary assumptions; the same results can be obtained without any knowledge of the example distribution. 4. Basic Properties of rep Beforegoing to the detailed probabilistic analysis of the rep algorithm, we examine some of its basic algorithmic properties. Throughout this paper we review the binary case for simplicity. The results, however, alsoapply with many-valued attributes and several classes. Now that the processing control of the rep algorithm has been settled, we can actually prove Quinlan's {1987} claim of the optimality of the pruning produced by rep. Observe that the following result holds true independent of the leaf labeling strategy. Theorem 1 Applying rep with a set of pruning examples, S , to a decision tre e T pro duc es T 0 026 a pruning of T 026 such that it is the smal lest of those prunings of T that have minimal error with resp e ct to the example set S . 170An Analysis of Reduced Error Pruning Proof We prove the claim by induction over the size of the tree. Observe that a decision tree T is a full binary tree, which has 2L{T } 000 1 nodes, where L{T } is the number of leaves in the tree. Base case. If L{T } = 1, then the original tree T consists of a single leaf node. T is the only possible pruning of itself. Thus, it is, trivially, also the smallest among the most accurate prunings of T . Inductive hypothesis. The claim holds when L{T } < k. Inductive step. L{T } = k. Let N be the root of the tree and T 0 and T 1 the left and the right subtree, respectively. Subtrees T 0 and T 1 must have strictly less than k leaves. When the pruning decision for N is taken, then 026 by the bottom-up recursive control strategy of rep 026 T 0 and T 1 have already been processed by the algorithm. By the inductive hypothesis, the subtrees after pruning, T 0 0 and T 0 1 , are the smallest possible among the most accurate prunings of these trees. {i}: Ac cur acy. The pruning decision for the node N consists of choosing whether to collapse N and the tree rooted at it into a ma jority leaf, or whether to maintain the whole tree. If both alternatives make the same number of errors, then N is collapsed and the original accuracy with respect to the pruning set is retained. Otherwise, by the rep algorithm, the pruning decision is based on which of the resulting trees would make less errors with respect to the pruning set S . Hence, whichever choice is made, the resulting tree T 0 will make the smaller number of errors with respect to S . Let us now assume that a pruning T 00 of T makes even less errors with respect to S than T 0 . Then T 00 must consist of the root N and two subtrees T 00 0 and T 00 1 , because the ma jority leaf cannot be more accurate than T 0 . Since T 00 is a more accurate pruning of T than T 0 , it must be that either T 00 0 is a more accurate pruning of T 0 than T 0 0 or T 00 1 is a more accurate pruning of T 1 than T 0 1 . By the inductive hypothesis both possibilities are false. Therefore, T 0 is the most accurate pruning of T . {ii}: Size. To see that the chosen alternative is also as small as possible, 034rst assume that T 0 consists of a single leaf. Such a tree is the smallest pruning of T , and in this case the claim follows. Otherwise, T 0 consists of the root node N and the two pruned subtrees T 0 0 and T 0 1 . Since this tree was not collapsed, the tree must be more accurate than the tree consisting of a single ma jority leaf. Now assume that there exists a pruning T 003 ofT that is as accurate as T 0 , but smaller. Because the ma jority leaf is less accurate than T 0 , T 003 must consist of the root node N and two subtrees T 003 0 and T 003 1 . Then, either T 003 0 is a smaller pruning of T 0 than T 0 0 , but as accurate, or T 003 1 is a smaller pruning of T 1 than T 0 1 , but as accurate. Both cases contradict the inductive hypothesis. Hence, T 0 is the smallest among the most accurate prunings of T . Thus, in any case, the claim follows for T . 2 We consider next the situation in an internal node of the tree, when the bottom-up pruning sweep reaches the node. From now on we are committed to leaf labeling by the ma jority of the pruning examples. 171Elomaa & K344344ri344inen Theorem 2 An internal node, which prior to pruning had no leaves as its children, wil l not be pruned by rep if it has a non-trivial subtre e when the bottom-up pruning sweep re aches it. Proof For an internal node N we have two possible cases in which it has non-trivial subtrees; either both its subtrees are non-trivial {non-leaf } or one of them is trivial. Let us review these cases. Let r T denote the error of {sub}tree T with respect to the part of the pruning set that reaches the root of T . By r L we denote the misclassi034cation rate of the ma jority leaf L that would replace T , if T was chosen to be pruned. Case I: Let the two subtrees of T , T 0 and T 1 , be non-trivial. Hence, both of them have been retained when the pruning sweep has passed them. Thus, r T 00:5. In the following we will analyze the situation in which the subtrees rooted at safe nodes have already been pruned into leaves. We bound the pruning probability of the tree starting from this initial con034guration. Since the bottom-up pruning may already have come to ahalt before that situation, the following results actually give too high a probability for pruning. Hence, the following upper bounds are not as tight as possible. 175Elomaa & K344344ri344inen We consider pruning a decision tree by rep as a trial whose result is decided by the set of pruning examples. By Theorem 4 we can approximate the probability that a tree will be pruned into a ma jority leaf by approximating the probability that al l safe nodes get a positive ma jority or a negative ma jority. The latter alternative is not very probable under the assumption p > :5. It is safe to assume that it never happens. We can consider sampling the pruning examples in two phases. First the attribute values are assigned. This decides the leaf into which the example falls. In the second phase we independently assign the class label for the example. Let the safe nodes of tree T be Z {T } = fz 1 ; : : : ; z k g and let the number of examples in the pruning set S be jS j = n. The number of pruning examples falling to a safe node z i is denoted by n i ; P k i=1 n i = n. For the time being we assume that n i > 0 for all i. The number of positive examples falling to safe node z i is the sum of independent Bernoulli variables and, thus, it is binomially distributed with parameters n i and p. Respectively, the number of negative pruning examples in safe node z i is X i 030 B {n i ; 1 000 p}. The probability that there is a ma jority of negative examples in safe node z i is Prf X i > n i =2 g. We can bound this probability from below by using the following inequality {Slud, 1977}. Lemma 5 {Slud's inequality} Let X 030 B {m; q} be a random variable with q 024 1=2. Then for m{1 000 q} 025 h 025 mq, Prf X 025 h g 025 1 000 010 h 000 mqpmq{1 000 q} ! : Since p > :5 and the random variable corresponding to the number of negative examples in safe node z i is X i 030 B {n i ; 1000p}, the 034rst condition of Slud's inequality holds. Furthermore, to see that condition m{1 000 q} 025 h 025 mq holds in safe node z i substitute h = n i =2, m = n i , and q = 1 000 p to obtain n i p 025 n i =2 025 n i {1000 p}. Thus, Pr 032 X i > n i2 033 025 1 000 010 n i =2 000 n i {1 000 p}pn i p{1 000 p} ! = 1000 010 {p 000 1=2}n ipn i p{1 000 p} ! : {1} As n i , the number of pruning instances reaching safe node z i , grows, then the standard normal distribution term in the above bound also grows. Hence, the bound on the probability that the ma jority of the pruning examples reaching z i is negative is the smaller the more pruning examples reach it. The probability of a negative ma jority also reduces through the growing probability of positive class for an example, p. These both are also re035ected in the pruning probabilities of the whole tree. We can now roughly approximate the probability that T will be pruned into a single ma jority leaf as follows. By Theorem 4, T will be pruned into a leaf if and only if each safe node in T receives a ma jority of positive examples. Because T has k safe nodes and there aren pruning examples, then according to the pigeon-hole principle at least half of the safe nodes receive at most r = 2n=k examples. Each safe node z i with n i 024 r examples has, by Inequality 1, a negative ma jority at least with probability 1 000 010 {p 000 1=2}rprp{1 000 p} ! : 176An Analysis of Reduced Error Pruning Observe that Inequality 1 also holds when n i :5. In order to obtain tighter bounds, one has to make assumptions about the shape of the tree T and the distribution of examples. The bound of Equation 2 depends on the size of the decision tree {re035ected by k}, the number {n} and the class distribution {p} of the pruning examples. Keeping other parameters constant and letting k grow reduces the pruning probability exponentially. If the number of pruning examples grows in the same proportion so that r = 2n=k staysconstant, the pruning probability still falls exponentially. Class distribution of the pruning examples also a033ects the pruning probability which is the smaller, the closer p is to value .5. 5.3 Implications of the Analysis It has been empirically observed that the size of the decision tree grows linearly with the training set size, even when the trees are pruned {Catlett, 1991; Oates & Jensen, 1997, 1998}. The above analysis gives us a possibility to explain this behavior. However, let us 034rst prove that when there is no correlation between the attribute values and the class label of an example, the size of the tree that perfectly 034ts the training data depends linearly on the size of the sample. Our setting is as simple as can be. We only have one real-valued attribute x and the class attribute y, whose value is independent of that of x. As before, y has two possible values, 0 and 1. The tree is built using binary splits of a numerical value range; i.e., propositions of type 020 x < r021 are assigned to the internal leaves of the tree. In this analysis duplicate instances occur with probability 0. Theorem 6 Let the training examples {x; y} be drawn from a distribution, where x is uni- formly distributed in the range [0; 1} and y obtains value 1, independent of x, with prob ability p, and value 0 with prob ability 1 000 p. Then the expe cte d size of the decision tre e that 034ts the data is linear in the size of the sample. Proof Let S = h{x 1 ; y 1 }; : : : ; {x t ; y t }i be a sample of the above described distribution. We may assume that x i 6= x j , when i 6= j , because the probability of the complement event is 0. Let us, further, assume that the examples of S have been indexed so that x 1 < x 2 < : : : < x t . Let A i be the indicator variable for the event that instances i and i + 1 have di033erent class labels; i.e., y i 6= y i+1 , 1 024 i 024 t0001. Then EA i = Prf A i = 1 g = p{1000p}+{1000p}p = 2p{1000p}, 177Elomaa & K344344ri344inen because when the event y i = 1 has probability p, at the same time the event y i+1 =0 has probability 1 000 p, and vice versa. Now the number of class alternations is A = P t0001 i=1 A i and its expectation is EA = t0001 X i=1 EA i = t0001 X i=1 2p{1 000 p} = 2p{1000 p} t0001 X i=1 1 = 2{t000 1}p{1 000 p}: {3} Let T be a decision tree that has been grown on the sample S . The growing has been continued until the training error is 0. Each leaf in T corresponds to a half open interval [a; b} in [0; 1}. If y i 6= y y+1 , then x i and x i+1 must fall into di033erent leaves of T , because otherwise one or the other example is falsely classi034ed by T . Thus, the upper boundary b of the interval corresponding to the leaf into which x i falls in must have a value less than x i+1 . Repetitively applying this observation when scanning through the examples from left to right, we see that T must at least have one leaf for x 1 and one leaf for each class alternation; i.e., A + 1 leaves in total. By using Equation 3 we see that the expected number of leaves in T is EA + 1 = 2{t000 1}p{1 000 p} + 1: In particular, this is linear in the size of the sample S ; jS j = t. 2 The above theorem only concerns zero training error trees built in the 034rst phase of decision tree induction. The empirical observations of Catlett {1991} and Oates and Jensen {1997, 1998}, however, concern decision trees that have been pruned in the second phase of induction. We come back to the topic of pruned trees shortly. Consider how rep is used in practice. There is some amount of {classi034ed} data available from the application domain. Let there be a total of t examples available. Some part ff of the data is used for tree growing and the remaining portion 1 000 ff of it is reserved as the separate pruning set; 0< ff < 1. Quite a common practice is to use two thirds of the data for growing and one third for pruning or nine tenths for growing and one tenth for pruning when {ten-fold} cross-validation is used. In the decision tree construction phase the tree is 034tted to the fft examples as perfectly as possible. If we hypothesize that the previous result holds for noisy real-world data sets, which by empirical evidence would appear to be the case, and that the number of safe nodes also grows linearly with the number of leaves, then the tree grown will contain fl t safe nodes, where fl > 0. Since the pruning set size also is a linear fraction of the training set size, the ratio r = 2n=k staysconstant in this setting. Hence, by Equation 2, the growing data set size forces the pruning probability to zero, even quite fast, because the reduction in the probability is exponential. 5.4 Limitations of the Analysis Empty subtrees, which do not receive any pruning examples, were left without attention above; we assumed that n i > 0 for each i. Empty subtrees, however, decisivelya033ect the analysis; they are automatically pruned away. Unfortunately, one cannot derive a non-trivial upper bound for the number of empty subtrees. In the worst case all pruning examples are routed to the same safe node, which leaves k 000 1 empty safe nodes to the tree. Subsequently we review the case where the examples are distributed uniformly to the safe nodes. Then better approximations can be obtained. 178An Analysis of Reduced Error Pruning Even though we assume that each pruning example is positive with a higher probability than .5, there are no guarantees that the ma jority of all examples is positive. However, the probability that the ma jority of all examples changes is very small, even negligible, by Cherno033 's inequality {Cherno033, 1952; Hagerup & R374b, 1990} when the number of pruning examples, n, is high and p is not extremely close to one half. Slud's inequality bounds the probability Prf X 025 h g, but above we used it to bound the probability Prf X > h g. Some continuity correction could be used to compensate this. In practice, the inexactness does not make any di033erence. Even though it would appear that the number of safe nodes increases in the same pro- portion as that of leaves when the size of the training set grows, we have not proved this result. Theorem 6 essentially uses leaf nodes, and does not lend itself to modi034cation, where safe nodes could be substituted in place of leaves. The relation between the number of safe nodes and leaves in a decision tree depends on the shape of the tree. Hence, the splitting criterion that was used in tree growing decisively a033ects this relation. Some splitting criteria aim at keeping the produced split as balanced as possible, while others aim at separating small class coherent subsets from the data {Quinlan, 1986; Mingers, 1989b}. For example, the common entropy-based criteria have a bias that favors balanced splits {Breiman, 1996}. Using a balanced splitting criterion would seem to imply that the number of safe nodes in a tree depends linearly on the number of leaves in the tree. In that case the above reasoning would explain the empirically observed linear growth of pruned decision trees. 6. Pruning Probability Under Uniform Distribution We now assume that all n pruning examples have an equal probability to end up in each of the k safe nodes; i.e., a pruning example falls to the safe node z i with probability 1=k. Contrary to the normal uniform distribution assumption analysis, for our analysis this is not the best case. Here the best distribution of examples into safe nodes would have one pruning example in each of the safe nodes except one, into which all remaining pruning instances would gather. Nevertheless, the uniformity lets us sharpen the general approximation by using standard techniques. The expected number of examples falling into any safe node is n=k. Let us calculate the expected number of those safe nodes that receive at most cn=k examples, where c is an arbitrary positive constant. Let Q i be the indicator for the event 020safe node z i receives at most cn=k examples.021 Then Q = P k i=1 Q i is the number of those safe nodes that receive less than cn=k examples. By the linearity of expectation EQ = P k i=1 EQ i = kEQ 1 , in which the last equality follows from the fact that the Q i -s are identically distributed. Let Y 1 be the number of examples reaching safe node z 1 . Because each of the n exam- ples reaches z 1 with probability 1=k independent of the other examples, Y 1 is binomially distributed with parameters n and 1=k. Clearly EQ 1 = Prf Y 1 024 cn=k g. We can approxi- mate the last probability by the normal approximation, from which we obtain Pr 032 Y 1 024 cnk 033 031 010 cn=k 000n=kpn 001 1=k 001{1 000 1=k} ! = 010 {c000 1}n=kpn=k{1 000 1=k} ! : 179Elomaa & K344344ri344inen Hence, by the above observation, EQ = kEQ 1 031 k010 {c000 1}n=kpn=k{1 000 1=k} ! : {4} We now use Approximation 4 to determine the probability that the whole decision tree T will be pruned into a single leaf. Let P be a random variable that represent the number ofthose safe nodes in T that receive at most cn=k examplesand at least one example. If we denote by R the number of empty safe nodes, we have P = Q 000 R. Hence, EP = E{Q 000 R} = EQ 000 ER. The following result {Kamath, Motwani, Palem, & Spirakis, 1994; Motwani & Raghavan, 1995} lets us approximate the number of empty safe nodes when n 035 k. Theorem 7 Let Z be the number of empty bins when m bal ls are thrown randomly into h bins. Then 026 = EZ = h 022 1000 1h m 031 he 000m=h and for 025 > 0, Prf jZ 000 026j 025 025 g 024 2 exp 000 025 2 {h 000 1=2}h 2 000 026 2 ! : By this result the expected number of empty safe nodes is approximately ke 000n=k ; this number is small when k isrelatively small compared to n. Substituting the above obtained approximation for EQ {Equation 4} and using the pre- vious result, we get EP = EQ 000 ER 031 k 010 {c000 1}n=kpn=k{1 000 1=k} ! 000 e 000n=k ! : Applying Slud's inequality we can, as before, bound from above the probability that the ma jority class does not change in a safe node that receives cn=k pruning examples. Since there are P such safe nodes and the class distribution of examples within them is independent, the event 020ma jority class does not change in any safe node that receives at least one and at most cn=k examples021 has the upper bound 010 {p 000 :5}rprp{1 000 p} !! P ; {5} where r = cn=k. Replacing P with its expected value in this equation we have an approxi- mation for the pruning probability. This approximation isvalid if P does not deviate a lot from its expected value. We consider the deviation of P from its expected value below. The above upper bound for the pruning probability is similar to the upper bound that was obtained without any assumptions about the distribution of the examples. However, the earlier constant 2 has been replaced by a new, controllable parameter c, and empty subtrees are now explicitly taken into account. If c is chosen suitably, this upper bound is more strict than the one obtained in the general case. 180An Analysis of Reduced Error Pruning atendUpper bound for the pruning probability 0.5 0.250.511.50.50.60.70.80.900.51cpFigure 3: The e033ect of parameters p and c on the upper bound of the pruning probability of a tree with 100 safe nodes when 500 pruning examples are used. The curves depictingthe 0.25 and 0.5 upper bounds are also shown. 6.1 An Illustration of the Upper Bound Figure 3 plots the upper bound of the pruning probability of a tree with 100 safe nodes when 500 pruning examples are used. The value of the parameter c varies from 0 to 2 and p varies from 0.5 to 1. We can observe that the surface corresponding to the upper bound stays very close to 0 when the class distribution is not too skewed and when the parameter c does not have a very small value. When the probability of an example having a positive class label hits value 0.75 or the value of c approaches 0, the upper bound climbs very steeply. At least on the part of the parameter c this is due to the inexactness of the approximation on the extreme values. When the probability p that an example has a positive class approaches 1, the error committed by a single positive leaf falls to 0. Hence, the accuracy of a non-trivial pruning has to be better, the closer p is to 1 for it to beat the ma jority leaf. Intuitively, theprobability that such a pruning exists 026 i.e., that the root node is not pruned 026 should drop to zero as p increases. The bound re035ects this intuition. Whenthe value of parameter c falls close to 0, the safe nodes that are taken into account in the upper bound only receive very few pruning examples. The number of such nodes is 181Elomaa & K344344ri344inen small. On the other hand, when c is increased, the number of nodes under consideration grows together with the upper limit on the number of examples reaching each single one of them. Thus, both small and large values of c yield loose bounds. In the strictest bounds the value of c is somewhere in 020the middle,021 in our example around values 1.00251.5. In the boundof Equation 5 the argument of the cumulative distribution function 010 tends towards zero when the value of c is very small, but at the same time the exponent decreases. The value of 010 approaches 1/2, when its argument goes to zero. On the other hand, when c has a large value, 010 approaches value 1 and the exponent P also increases. 6.2 On the Exactness of the Approximation Above we used the expected value of P in the analysis; EP = EQ 000 ER. We now probe into the deviation of P from its expected value. The deviation of R is directly available from Theorem 7: Prf jR 000 ERj 025 025 g 024 2 exp 000 025 2 {k 0001=2}k 2 000 E 2 R ! : For Q we do not have a similar result yet. In this section we provide one. Let us 034rst recapitulate the de034nition of the Lipschitz condition. De034nition Let f : D 1 002 001 001 001 002 D m ! IR be a real-valued function with m arguments from possibly distinct domains. The function f is said to satisfy the Lipchitz condition if for any x 1 2 D 1 ; : : : ; x m 2 D m , any i2 f1; : : : ; mg, and any y i 2 D i , jf {x 1 ; : : : ; x i0001 ; x i ; x i+1 ; : : : ; x m } 000 f {x 1 ; : : : ; x i0001 ; y i ; x i+1 ; : : : ; x m }j 024 1: Hence, a function satis034es the Lipschitz condition if an arbitrary change in the value of any one argument does not change the value of the function more than 1. The following result {McDiarmid, 1989} holds for functions satisfying the Lipschitz con- dition. More general results of the same kind can be obtained using martingales {see e.g., {Motwani & Raghavan, 1995}). Theorem 8 {McDiarmid} Let X 1 ; : : : ; X m be independent random variables taking values in a set V . Let f : V m ! IR be such that, for i = 1; : : : ; m: sup x 1 ;:::;x m ;y i 2V jf {x 1 ; : : : ; x i0001 ; x i ; x i+1 ; : : : ; x m } 000 f {x 1 ; : : : ; x i0001 ; y i ; x i+1 ; : : : ; x m }j 024 c i : Then for 025 > 0, Prf jf {X 1 ; : : : ; X m } 000 Ef {X 1 ; : : : ; X m }j 025 025 g 024 2 exp 000 2025 2P m i=1 c 2 i ! : Let W i , i = 1; : : : ; n, be a random variable such that W i = j if the i-th example is directed to the safe node z j . By the uniform distribution assumption W i -s are independent. They have their values within the set f1; : : : ; kg. Let us de034ne the function f so that f {w 1 ; : : : ; w n } is the number of those safe nodes that receive at most r =cn=k examples, when the i-th example is directed to the safe node z w i . That is, f {w 1 ; : : : ; w n } = jf i 2 f 1; : : : ; k g j jS i j 024 r gj; 182An Analysis of Reduced Error Pruning where S i is the set of those examples that are directed to safe node z i ; S i = f h 2 f 1; : : : ; n g j w h = i g: Hence, Q = f {W 1 ; : : : ; W n }. Moving any one example from one safe node to another {chang- ing the value of any one argument w i }, can change one more safe node z i to ful034ll the con- dition jS i j 024 r, one less safe node to ful034ll it, or both at the same time. Thus, the value of f changes by at most 1. Hence, the function ful034lls the Lipschitz condition. Therefore, we can apply McDiarmid's inequality to it by substituting c i = 1 and observing that then P n i=1 c 2 i = n: Prf jf {W 1 ; : : : ; W n } 000 Ef {W 1 ; : : : ; W n }j 025 025 g 024 2e 0002025 2 =n ; or equally Prf jQ 000 EQj 025 025 g 024 2e 0002025 2 =n : Unfortunately, this concentration bound is not very tight. Nevertheless, combining the concentration bounds for Q and R we have for P the following deviation from its expected value. Since jP 000 EP j = jQ 000 R 000 E{Q 000 R}j = jQ 000 EQ + ER 000 Rj 024 jQ000 EQj + jR 000 ERj, jQ 000 R 000 E{Q 000 R}j 025 025 implies that jQ 000 EQj 025 025=2 or jR 000 ERj 025 025=2. Thus, Prf jP 000 EP j 025 025 g = Prf jQ 000 R 000 E{Q 000 R}j 025 025 g 024 Pr 032 jQ 000 EQj 025 0252 033 + Pr 032 jR 000 ERj 025 0252 033 024 2 exp 000 025 22n ! + 2 exp 000 025 2 {k 0001=2}4{k 2 000 E 2 R} ! : 7. Related Work T raditional pruning algorithms 026 like cost-complexity pruning {Breiman et al., 1984}, pes- simisticpruning {Quinlan, 1987}, minimum error pruning {Niblett & Bratko, 1986; Cestnik & Bratko, 1991}, critical value pruning {Mingers, 1989a}, and error-based pruning {Quinlan, 1993} 026 have already been covered extensively in earlier work {Mingers, 1989a; Esposito et al., 1997; Frank, 2000}. Thus we will not touch on these methods any further. Instead, we review some of the more recent work on pruning. rep produces an optimal pruning of the given decision tree with respect to the pruning set. Other approaches for producing optimal prunings have also been presented {Breiman et al., 1984; Bohanec & Bratko, 1994; Oliver & Hand, 1995; Almuallim, 1996}. However, often optimality is measured over the training set. Then it is only possible to maintain the initial accuracy, assuming that no noise is present. Neither is it usually possible to reduce the size of the decision tree without sacri034cing the classi034cation accuracy. For example, in the work of Bohanec and Bratko {1994} it was studied how to e036ciently 034nd the optimal pruning in the sense that the output decision tree is the smallest pruning which satis034es a given accuracy requirement. A somewhat improved algorithm for the same problem was presented subsequently by Almuallim {1996}. 183Elomaa & K344344ri344inen The high level control of Kearns and Mansour's {1998} pruning algorithm is the same bottom-up sweep as in rep. However, the pruning criterion in their method is a kind of a cost-c omplexity condition {Breiman et al., 1984} that takes both the observed classi034cation error and {sub}tree complexity into account. Moreover, their pruning scheme does not require the pruning set to be separate from the training set. Both Mansour's {1997} and Kearnsand Mansour's {1998} algorithms are pessimistic : they try to bound the true error of a {sub}tree by its training error. Since the training error is by nature optimistic, the pruning criterion has to compensate it by being pessimistic about the error approximation. Consider yet another variant of rep, one which is otherwise similar to the one analyzed above, with the exception that the original leaves are not put to a special status, but can berelabeled by the ma jority of the pruning examples just like internal nodes. This version of rep produces the optimal pruning with respect to which the performance of Kearns and Mansour's{1998} algorithm is measured. Their pessimistic pruning produces a decision tree that is smaller than that produced by rep. Kearns and Mansour {1998} are able to prove that their algorithm has a strong perfor- mance guarantee. The generalization error of the produced pruning is bounded by that of the best pruning of the given tree plus a complexity penalty. The pruning decisions are local in the same sense as those of rep and only the basic pruning operation of replacing a subtree with a leaf is used in this pruning algorithm. 8. Conclusion In this paper the rep algorithm has been analyzed in three di033erent settings. First, we studied the algorithmic properties of rep alone, without assuming anything about the input decision tree nor pruning set. In this setting it is possible to prove that rep ful034lls its intended task and produces an optimal pruning of the given tree. The algorithm proceeds to prune the nodes of a branch as long as both subtrees of an internal node are pruned and stops immediately if even one subtree is kept. Moreover, it prunes an interior node only if all its descendants at level d have been pruned. Furthermore, rep either halts before the safe nodes are reached or prunes the whole tree only in case all safe nodes have the same ma jority class. In the second setting the tree under consideration was assumed to 034t noise; i.e., it was assumed that the class label of the pruning examples is independent of their attribute values. In this setting the pruning probability of the tree could be bound by an equation that depends exponentially on the size of the tree and linearly on the number and class distributionof the pruning examples. Thus, our analysis corroborates the main 034nding of Oates and Jensen {1999} that rep fails to control the growth of a decision tree in the extreme case that the tree 034ts pure noise. Moreover, our analysis opened a possibility to initially explain why the learned decision tree grows linearly with an increasing data set. Our bound on the pruning probability of a tree is based on bounding the probability that all safe nodes have the same ma jority class. Surprisingly, essentially the same property, whose probability we try to bound close to 0, is assumed to hold with probability 1 in the analysis of Oates and Jensen {1999}. In rep it may happen that no pruning examples are directed to a given subtree. Such subtrees have not been taken into account in earlier analyses. In our 034nal analysis we 184An Analysis of Reduced Error Pruning included empty subtrees in the equation for a tree's pruning probability. Taking empty subtrees into account gives a more realistic bound for the pruning probability of a tree. Unfortunately, one cannot draw very de034nite general conclusions on the two-phased top- down induction of decision trees on the basis of analyses on the rep algorithm, because its bias is quite unique among pruning algorithms. The fact that rep does not penalize the size of a tree, but only rests on the classi034cation error on the pruning examples makes the method sensitive to small changes in the class distribution of the pruning set. Other decision tree pruning algorithms also have their individual characteristics. Therefore, uni034ed analysis of decision tree pruning may be impossible. The version of rep, in which one is allowed to relabel original leaves, as well, is used as the performance ob jective in Kearns and Mansour's {1998} pruning algorithm. Thus, the performance of pruning algorithms that use both error and size penalty is related to thosethat use only error estimation. In the version of rep used by Kearns and Mansour our analysis based on safe nodes applies with leaves in place of safe nodes. Hence for this algorithm the derived bounds are stricter. We leave the detailed analysis of other important pruning algorithms as future work. Only through such investigation is it possible to disclose the di033erences and similarities of pruning algorithms. Empirical examination has not managed to reveal clear performance di033erences between the methods. Also, the relationship of the number of safe nodes and leaves of a tree ought to be examined analytically and empirically. In particular, one should study whether the number of safe nodes does increase linearly with a growing training set, as conjectured in this paper. Deeper understanding of existing pruning algorithms may help to overcome the problems associated with the pruning phase of decision tree learning. References Almuallim, H. {1996}. An e036cient algorithm for optimal pruning of decision trees. Arti034cial Intel ligence, 83, 347025362. Bohanec,M., & Bratko, I. {1994}. Trading accuracy for simplicity in decision trees. Machine Le arning, 15 {3}, 223025250. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. {1984}. Classi034cation and Re gr ession Tr e es. Wadsworth, Paci034c Grove, CA. Breiman, L. {1996}. Some properties of splitting criteria. Machine Le arning, 24 {1}, 4102547. Catlett, J. {1991}. Overpruning large decision trees. In Pro c e e dings of the Twelfth Interna- tional Joint Conferenc e on Arti034cial Intel ligence, pp. 764025769, San Mateo, CA. Morgan Kaufmann. Cestnik, B., & Bratko, I. {1991}. On estimating probabilities in tree pruning. In Kodrato033, Y. {Ed.}, Machine Le arning026EWSL-91: Pro c e e dings of the Fifth Europ e an Working Session, Vol. 482 of Le ctur e Notes in Arti034cial Intel ligence, pp. 138025150, Berlin, Hei- delberg, New York. Springer-Verlag. Cherno033, H. {1952}. A measure of asymptotic e036ciency for tests of a hypothesis based on the sum of observations. Annals of Mathematical Statistics, 23 {4}, 493025507. 185Elomaa & K344344ri344inen Cohen, W. W. {1993}. E036cient pruning methods for separate-and-conquer rule learning systems.In Pro c e e dings ofthe Thirteenth International Joint Conferenc e on Arti034cial Intel ligence, pp. 988025994, San Mateo, CA. Morgan Kaufmann. Esposito, F., Malerba, D., & Semeraro, G. {1993}. Decision tree pruning as a search in the state space. In Brazdil, P. B. {Ed.}, Machine Le arning: ECML-93, Pro c e e dings of the Sixth Europ e an Conferenc e, Vol. 667 of Le ctur e Notes in Arti034cial Intel ligence, pp. 165025184, Berlin, Heidelberg, New York. Springer-Verlag. Esposito, F., Malerba, D., & Semeraro, G. {1997}. A comparative analysis of methods for pruning decision trees. IEEE Tr ansactions on Pattern Analysis and Machine Intel li- gence, 19 {5}, 476025491. Frank, E. {2000}. Pruning Decision Tr e es and Lists. Ph.D. thesis, University of Waik ato, Department of Computer Science, Hamilton, New Zealand. Hagerup, T., & R374b, C. {1990}. A guided tour of Cherno033 bounds. Information Pro c essing L etters, 33 {6}, 305025308. Helmbold, D. P., & Schapire, R. E. {1997}. Predicting nearly as well as the best pruning of a decision tree. Machine Le arning, 27 {1}, 5102568. Holte, R. C., Acker, L., & Porter, B. {1989}. Concept learning and the problem of small disjuncts. In Pro c e e dings ofthe Eleventh International Joint Conferenc e on Arti034cial Intel ligence, pp. 813025818, San Mateo, CA. Morgan Kaufmann. Kamath, A., Motwani, R., Palem, K., & Spirakis, P. {1994}. Tail bounds for occupancy and the satis034ability threshold conjecture. In Pro c e e dings ofthe Thirty-Fifth Annual IEEE Symposium on Foundations of Computer Science, pp. 592025603, Los Alamitos, CA. IEEE Press. Kearns, M., & Mansour, Y. {1998}. A fast, bottom-up decision tree pruning algorithm with near-optimal generalization. In Shavlik, J. {Ed.}, Pro c e e dings ofthe Fifteenth Inter- national Conferenc e on Machine Le arning, pp. 269025277, San Francisco, CA. Morgan Kaufmann. Malerba, D., Esposito, F., & Semeraro, G. {1996}. A further comparison of simpli034cation methods for decision-tree induction. In Fisher, D., & Lenz, H.-J. {Eds.}, Le arning from Data: AI and Statistics V, pp. 365025374, Berlin, Heidelberg, New York. Springer-Verlag. Mansour, Y. {1997}. Pessimistic decision tree pruning based on tree size. In Fisher, D. H. {Ed.}, Pro c e e dings of the Fourte enth International Conferenc e on Machine Le arning, pp. 195025201, San Francisco, CA. Morgan Kaufmann. McDiarmid, C. J. H. {1989}. On the method of bounded di033erences. In Siemons, J. {Ed.}, Surveys in Combinatorics: Invited Papers of the 12th British Combinatorial Confer- ence, pp. 148025188, Cambridge, U.K. Cambridge University Press. Mingers, J. {1989a}. An empirical comparison of pruning methods for decision tree induction. Machine Le arning, 4 {2}, 227025243. Mingers, J. {1989b}. An empirical comparison of selection measures for decision-tree induc- tion. Machine Le arning, 3 {4}, 319025342. Mitchell, T. M. {1997}. Machine Le arning. McGraw-Hill, New York. 186An Analysis of Reduced Error Pruning Motwani, R., & Raghavan, P. {1995}. Randomize d Algorithms. Cambridge University Press, New York. Niblett, T., & Bratko, I. {1986}. Learning decision rules in noisy domains. In Bramer, M. A. {Ed.}, Rese ar ch and Development in Expert Systems III, pp. 2502534, Cambridge, UK. CambridgeUniversity Press. Oates, T., & Jensen, D. {1997}. The e033ects of training set size on decision tree complexity. In Fisher, D. H. {Ed.}, Pro c e e dings ofthe Fourte enth International Conferenc e on Machine Le arning, pp. 254025261, San Francisco, CA. Morgan Kaufmann. Oates, T., & Jensen, D. {1998}. Large datasets lead to overly complex models: An expla- nation and a solution. In Agrawal, R., Stolorz, P., & Piatetsky-Shapiro, G. {Eds.}, Pro c e e dings ofthe Fourth International Conferenc e on Know ledge Discovery and Data Mining, pp. 294025298, Menlo Park, CA. AAAI Press. Oates, T., & Jensen, D. {1999}. Toward atheoretical understanding of why and when decision tree pruning algorithms fail. In Pro c e e dings of the Sixteenth National Confer- ence on Arti034cial Intel ligence, pp. 372025378, Menlo Park, CA/Cambridge, MA. AAAI Press/MIT Press. Oliver, J. J., & Hand, D. J. {1995}. On pruning and averaging decision trees. In Prieditis, A., & Russell, S. {Eds.}, Pro c e e dings of the Twelfth International Conferenc e on Machine Le arning, pp. 430025437, San Francisco, CA. Morgan Kaufmann. Pagallo, G., & Haussler, D. {1990}. Boolean feature discovery in empirical learning. Machine Le arning, 5 {1}, 7102599. Pereira, F., & Singer, Y. {1999}. An e036cient extension to mixture techniques for prediction and decision trees. Machine Le arning, 36 {3}, 183025199. Quinlan, J. R. {1986}. Induction of decision trees. Machine Le arning, 1, 81025106. Quinlan, J. R. {1987}. Simplifying decision trees. International Journal of Man-Machine Studies, 27 {3}, 221025248. Quinlan, J. R. {1993}. C4.5: Pro gr ams for Machine Le arning. Morgan Kaufmann, San Mateo, CA. Slud, E. V. {1977}. Distribution inequalities for the binomial law. TheAnnals of Prob ability, 5 {3}, 404025412. 187