An Analysis of Reduced Error Pruning

T. Elomaa and M. Kaariainen

Volume 15, 2001

Links to Full Text:

Abstract:
Top-down induction of decision trees has been observed to suffer from the inadequate functioning of the pruning phase. In particular, it is known that the size of the resulting tree grows linearly with the sample size, even though the accuracy of the tree does not improve. Reduced Error Pruning is an algorithm that has been used as a representative technique in attempts to explain the problems of decision tree learning.

In this paper we present analyses of Reduced Error Pruning in three different settings. First we study the basic algorithmic properties of the method, properties that hold independent of the input decision tree and pruning examples. Then we examine a situation that intuitively should lead to the subtree under consideration to be replaced by a leaf node, one in which the class label and attribute values of the pruning examples are independent of each other. This analysis is conducted under two different assumptions. The general analysis shows that the pruning probability of a node fitting pure noise is bounded by a function that decreases exponentially as the size of the tree grows. In a specific analysis we assume that the examples are distributed uniformly to the tree. This assumption lets us approximate the number of subtrees that are pruned because they do not receive any pruning examples.

This paper clarifies the different variants of the Reduced Error Pruning algorithm, brings new insight to its algorithmic properties, analyses the algorithm with less imposed assumptions than before, and includes the previously overlooked empty subtrees to the analysis.



Extracted Text

Journal of Arti034cial Intelligence Research 15 {2001} 163-187 Submitted
10/00; published 09/01
 An Analysis of Reduced Error Pruning Tapio Elomaa elomaa@cs.helsinki.fi
Matti K344344ri344inen matti.kaariainen@cs.helsinki.fi
 Department of Computer Science
 P. O. Box 26 {Teollisuuskatu 23}
 FIN-00014 University of Helsinki, Finland
 Abstract
 Top-down induction of decision trees has been observed to su033er from
the inadequate functioning of the pruning phase. In particular, it is known
that the size of the resulting
 tree grows linearly with the sample size, even though the accuracy of the
tree does not
 improve. Reduced Error Pruning is an algorithm that has been used as a representative
 techniquein attempts to explain the problems of decision tree learning.
In this paper we present analyses of Reduced Error Pruning in three di033erent
settings.
 First we study the basic algorithmic properties of the method, properties
that hold inde-
 pendent of the input decision tree and pruning examples. Then we examine
a situation that
 intuitively should lead to the subtree under consideration to be replaced
by a leaf node,
 one in which the class label and attribute values of the pruning examples
are independent
 of each other. This analysis is conducted under two di033erent assumptions.
The general
 analysis shows that the pruning probability of a node 034tting pure noise
is bounded by a
 function that decreases exponentially as the size of the tree grows. In
a speci034c analysis
 we assume that the examples are distributed uniformly to the tree. This
assumption lets
 us approximate the number of subtrees that are pruned because they do not
receive any
 pruningexamples.
 This paper clari034es the di033erent variants of the Reduced Error Pruning
algorithm,
 brings new insight to its algorithmic properties, analyses the algorithm
with less imposed
 assumptions than before, and includes the previously overlooked empty subtrees
to the
 analysis. 1. Introduction
 Decision tree learning is usually a two-phase process {Breiman, Friedman,
Olshen, & Stone, 1984; Quinlan, 1993}. First a tree re035ecting the given
sample as faithfully as possible is
 constructed. If no noise prevails, the accuracy of the tree is perfect on
the training examples
 that were used to build the tree. In practice, however, the data tends to
be noisy, which
 may introduce contradicting examples to the training set. Hence, 100045
accuracy cannot
 necessarily be obtained even on the training set. In any case, the resulting
decision tree is over034tted to the sample; in addition to the general trends
of the data, itencodes the
 peculiarities and particularities of the training data, which makes it a
poor predictor of the class label of future instances. In the second phase
of induction, the decision tree is pruned
 in order to reduce its dependency on the training data. Pruning aims at
removing from the tree those parts that are likely to only be due to the
chance properties of the training set.
 The problems of the two-phased top-down induction of decision trees are
well-known
 and have been extensively reported {Catlett, 1991; Oates & Jensen, 1997,
1998}. The size
 c
 fl2001 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved.Elomaa
& K344344ri344inen of the tree grows linearly with the size of the training
set, even though after a while no
 accuracy is gained through the increased tree complexity. Obviously, pruning
is intended to
 034ght this e033ect. Another defect is observed when the data contains
no relevant attributes; i.e., when the class labels of the examples are independent
of their attribute values. Clearly,
 a single-node tree predicting the ma jority label of the examples should
result in this case,
 since no help can be obtained by querying the attribute values. In practice,
though, often largedecision trees are built from such data.
 Many alternative pruning schemes exist {Mingers, 1989a; Esposito, Malerba,
& Semer- aro, 1997; Frank, 2000}. They di033er, e.g., on whether a single
pruned tree or a series of
 pruned trees is produced, whether a separate set of pruning examples is
used, which aspects
 {classi034cation error and tree complexity} are taken into account in pruning
decisions, how
 these aspects are determined, and whether a single scan through the tree
su036ces or whether
 iterative processing is required. The basic pruning operation that is applied
to the tree
 is the replacement of an internal node together with the subtree rooted
at it with a leaf.
 Also more elaborated tree restructuring operations are used by some pruning
techniques
 {Quinlan,1987, 1993}. In this paper, the only pruning operation that is
considered is the
 replacementof a subtree by the majority leaf, i.e., a leaf labeled by the
ma jority class of the
 examples reaching it. Hence, a pruning of a tre e is a subtree of the original
tree with just
 zero, one, or more internal nodes changed into leaves.
 Re duc e d Error Pruning {subsequently rep for short} was introduced by
Quinlan {1987} in
 the context of decision tree learning. It has subsequently been adapted
to rule set learning as
 well {Pagallo & Haussler, 1990; Cohen, 1993}. rep is one of the simplest
pruning strategies.
 In practical decision tree pruning rep is seldom used, because it has the
disadvantage of requiring a separate set of examples for pruning. Moreover,
it is considered too aggressive a
 pruning strategy that overprunes the decision tree, deleting relevant parts
from it {Quinlan,
 1987; Esposito et al., 1997}. The need for a pruning set is often considered
harmful because
 of the scarceness of the data. However, in the data mining context the examples
are often
 abundant and setting a part of them aside for pruning purposes presents
no problem.
 Despite its shortcomings rep is a baseline method to which the performance
of other
 pruning algorithms is compared {Mingers, 1989a; Esposito,Malerba, & Semeraro,
1993;
 Esposito et al., 1997}. It presents a good starting point for understanding
the strengths
 and weaknesses of the two-phased decision tree learning and o033ers insight
to decision tree
 pruning. rep has the advantage of producing the smallest pruning among those
that are the
 most accurate with respect to the pruning set. Recently, Oates and Jensen
{1999} analyzed
 repin an attempt to explain why and when decision tree pruning fails to
control the growth
 of the tree, even though the data do not warrant the increased size. We
approach the same sub ject, but try to avoid restricting the analysis with
unnecessary assumptions. We also
 consider an explanation for the unwarranted growth of the size of the decision
tree.
 In this paper we analyze rep in three di033erent settings. First, we explore
the basic algo- rithmic properties of rep, which apply regardless of the
distribution of examples presented
 to the learning algorithm. Second, we study, in a probabilistic setting,
the situation in which
 the attribute values are independent of the classi034cation of an example.
Even though this
 pure noise 034tting situation is not expected to arise when the whole pruning
set is considered,
 it is encountered at lower levels of the tree, when all relevant attributes
have already been exhausted. We further assume that all subtrees receive
at least one pruning example, sothat
 164An Analysis of Reduced Error Pruning
 none of them can be directly pruned due to not receiving any examples. The
class value is
 also assigned at random to the pruning examples. In our third analysis it
is assumed that
 each pruning example has an equal chance to end up in any one of the subtrees
of the tree
 being pruned. This rather theoretical setting lets us take into account
those subtrees that
 do not receive any examples. They have been left without attention in earlier
analyses.
 The rest of this paper is organized as follows. The next section discusses
the di033erent
 versions of the rep algorithm and 034xes the one that is analyzed subsequently.
In Section 3 we review earlier analyses of rep. Basic algorithmic properties
of rep are examined in
 Section 4. Then, in Section 5, we carry out a probabilistic analysis of
rep, without making any assumptions about the distribution of examples. We
derive a bound for the pruning
 probability of a tree which depends exponentially on the relation of the
number of pruning
 examples and the size of the tree. Section 6 presents an analysis, which
assumes that the
 pruning examples distribute uniformly to the subtrees of the tree. This
assumption lets us
 sharpen the preceding analysis on certain aspects. However, the bounds of
Section 5 hold with certainty, while those of Section 6 are approximate results.
Further related research is
 brie035y reviewed in Section 7 and, 034nally, in Section 8 we present
the concluding remarks of
 this study.
 2. Reduced Error Pruning Algorithm
 rep was never introduced algorithmically by Quinlan {1987}, which is a source
of much confusion. Even though rep is considered and appears to be a very
simple, almost trivial,
 algorithm for pruning, there are many di033erent algorithms that go under
the same name.
 No consensus exists whether rep is a bottom-up algorithm or an iterative
method. Neither is it obvious whether the training set or pruning set is
used to decide the labels of the leaves
 that result from pruning. 2.1 High-Level Control Quinlan's {1987, p. 225025226}
original description of rep does not clearly specify the pruning
 algorithm and leaves room for interpretation. It includes, e.g., the following
characteriza-
 tions.
 020For every non-leaf subtree S of T we examine the change in misclassi034cations
 over the test set that would occur if S werereplaced by the best possible
leaf.
 If the new tree would give an equal or fewer number of errors and S containsno
subtree with the same property, S is replaced by the leaf. The process continues
 until any further replacements would increase the number of errors over
the test
 set.
 [...] the 034nal tree is the most accurate subtree of the original tree
with respect
 to the test set and the smallest tree with that accuracy.021
 Quinlan {1987, p. 227} also later continues to give the following description.
 020This method [pessimistic pruning] has two advantages. It is much faster
than
 either of the preceding methods [cost-complexity and reduced error pruning]
 since each subtree is examined at most once.021
 165Elomaa & K344344ri344inen On one hand this description requires the
nodes to be processed in a bottom-up manner, since subtrees must be checked
for 020the same property021 before pruning a node but, on the
 other hand, the last quotation would indicate rep to be an iterative method.
We take
 rep to have the following single-scan bottom-up control strategy like in
most other studies
 {Oates & Jensen, 1997, 1998, 1999; Esposito et al., 1993, 1997; Kearns &
Mansour, 1998}.
 Nodes are pruned in a single bottom-up sweep through the decision tree,
prun-
 ing each node is considered as it is encountered. The nodes are processed
in
 postorder.
 By this order of node processing, any tree that is a candidate for pruning
itself cannot contain a subtree that could still be pruned without increasing
the tree's error. Due to the ambiguity of rep's de034nition, a di033erent
version of rep also lives on {Mingers,
 1989a; Mitchell, 1997}. It is probably due to Mingers' {1989} interpretation
ofQuinlan's ambiguous de034nition.
 Nodes are pruned iteratively, always choosing the node whose removal most
 increases the decision tree accuracy over the pruning set. The process continues
 until further pruning is harmful.
 However, this algorithm appears to be incorrect. Esposito et al. {1993,
1997} have shown that a tree produced by this algorithm does not meet the
ob jective of being the most accurate
 subtree with respect to the pruning set. Moreover, this algorithm overlooks
the explicit
 requirement of checking whether a subtree would lead to reduction of the
classi034cation
 error. Other iterative algorithms could be induced from Quinlan's original
description. How- ever, if the explicit requirement of checking whether a
subtree could be pruned before prun-
 ing a supertree is obeyed, then these versions of rep will all reduce to
the more e036cient
 bottom-up algorithm.
 2.2 Leaf Labeling Another source of confusion in Quinlan's {1987} description
of rep is that it is not clearly
 speci034ed how to choose the labels for the leaves that are introduced
to the tree through
 pruning. Oates and Jensen {1999} interpreted that the intended algorithm
would label the new leaves according to the ma jority class of the training
examples, but themselves analyzed a version of the algorithm where the new
leaves obtain as their labels the ma jority of the
 pruning examples. Oates and Jensen motivated their choice by the empirical
observation
 that in practice there is very little di033erence between choosing the
leaf labels in either way. However, choosing the labels of pruned leaves
according to the ma jority of pruning examples
 will set such leaves into a di033erent status than the original leaves,
which have as their label
 the ma jority class of training examples.
 Example Figure 1 shows a decision tree that will be pruned into a single
leaf if the training examples are used to label pruned leaves. A negative
leaf replaces the root of the
 tree and makes two mistakes on the pruning examples, while the original
tree makes three
 mistakes. With this tree we can illustrate an important di033erence in
using training and
 166An Analysis of Reduced Error Pruning
Tapio Elomaa0/10/12/02/0--+-+-+Figure 1: A {part of a} decision tree. The
labels inside the nodes denote the ma jority classes
 of training examples arriving to these nodes. For leaves the numbers of
pruning
 examples from the two classes are also given. x=y means that x negative
and y
 positive instances reach the leaf.
 pruning examples to label pruned leaves. Using training examples and proceeding
bottom- up, observe that neither subtree is pruned, since the left one replaced
with a negative leaf
 would make two mistakes instead of the original one mistake. Similarly,
the right subtree
 replaced with a positive leaf would result in an increased number of classi034cation
errors.
 Nevertheless, the root node 026 even though its subtrees have not been
pruned 026 can still be
 pruned.
 When pruning examples are used to label pruned leaves, anode with two non-trivial
subtrees cannot be pruned unless both its subtrees are collapsed into leaves.
The next
 section will prove this. In the tree of Figure 1 both subtrees would be
collapsed into zero-
 error leaves. However, in this case the root node will not be pruned.
 A further possibility for labeling the leaf nodes would be to take both
training and pruning examples into account in deciding the label of a pruned
leaf. Depending on the
 relation of the numbers of training and pruning examples this strategy resembles
one or the
 other of the above-described approaches. Usually the training examples are
more numerous
 than the pruning examples, and will thus dominate. In practice it is impossible
to discern
 this labeling strategy from that of using the ma jority of training examples.
 2.3 Empty Subtrees
 Since rep uses di033erent sets of examples to construct and to prune a
decision tree, itis
 possible that some parts of the tree do not receive any examples in the
pruning phase. Such parts of the decision tree, naturally, can be replaced
with a single leaf without changing
 the number of classi034cation errors that the tree makes on the pruning
examples. In other
 words, subtrees that do not obtain any pruning examples are always pruned.
Quinlan {1987} already noted that the parts of the original tree that correspond
to rarer special cases, which
 are not represented in the pruning set, may be excised.
 167Elomaa & K344344ri344inen DecisionTree REP{ DecisionTree T, ExampleArray
S }
 { for { i = 0 to S.length-1 } classify{ T, S[i] };
 return prune{ T }; }
 void classify{ DecisionTree T, Example e } { T.total++; if { e.label ==
1 } T.pos++; // update node counters
 if { !leaf{T} }
 if { T.test{e} == 0 } classify{ T.left, e };
 else classify{ T.right, e }; } int prune{ DecisionTree T } // Output classification
error after pruning T { if { leaf{T} }
 if { T.label == 1 } return T.total - T.pos;
 else return T.pos;
 else
 { error = prune{ T.left } + prune{ T.right };
 if { error < min{ T.pos, T.total - T.pos } }
 return error;
 else
 { replace T with a leaf;
 if { T.pos > T.total - T.pos }
 { T.label = 1; return T.total - T.pos; } else
 { T.label = 0; return T.pos; } } } }
 Table 1: The rep algorithm. The algorithm 034rst classi034es the pruning
examples in a top- down pass using method classify and then during a bottom-up
pass prunes the tree using method prune. Intuitively, it is not clear which
is the best-founded strategy for handling empty subtre es,
 those that do not receive any examples. On one hand they obtain support
from the training
 set, which usually is more numerous than the pruning set but, on the other
hand, the fact
 that no pruning example corresponds to these parts of the tree would justify
drawing the
 conclusion that these parts of the decision tree were built by chance properties
of the training
 data. In rep, consistently with preferring smaller prunings also otherwise,
the latter view is adopted.
 The problem of empty subtrees is connected to the problem of smal l disjuncts
in machine learning algorithms {Holte, Acker, & Porter, 1989}. A small disjunct
covers only a small
 number of the training examples. Collectively the small disjuncts are responsible
for a
 small number of classi034cation decisions, but they accumulate most of
the error of the whole
 concept. Nevertheless, small disjuncts cannot be eliminated altogether,
without adversely a033ecting other disjuncts in the concept. 168An Analysis
of Reduced Error Pruning
 2.4 The Analyzed Pruning Algorithm
 Let us brie035y reiterate the details of the rep algorithm that is analyzed
subsequently. As al- ready stated, the control strategy of the algorithm
is the single-sweep bottom-up processing.
 First, a top-down traversal drives the pruning examples through the tree
to the appropriate
 leaves. The counters of the nodes en route are updated. Second, during a
bottom-up traver-
 sal the pruning operations indicated by the classi034cation errors are
executed. The errors
 can be determined on the basis of the node counter values. In the bottom-up
traversal each node is visited only once. The pruned leaves are labeled by
the ma jority of the pruning set
 {see Table 1}.
 3. Previous Work
 Pruning of decision trees has recently received a lot of analytical attention;
existing pruning
 methods have been analyzed {Esposito et al., 1993, 1997; Oates & Jensen,
1997, 1998,
 1999} and new analytically-founded pruning techniques have been developed
{Helmbold & Schapire, 1997; Pereira & Singer, 1999; Mansour, 1997; Kearns&
Mansour, 1998}. Also
 many empirical comparisons of pruning have appeared {Mingers, 1989a; Malerba,
Esposito,
 & Semeraro, 1996; Frank, 2000}. In this section we review earlier work that
concerns the
 repalgorithm. Further related research is considered in Section 7.
 Esposito et al. {1993} viewed the rep algorithm, among other pruning methods,
as a
 search process in the state space. In addition to noting that the iterative
version of rep
 cannot produce the optimal result required by Quinlan {1987}, they also
observed that even though rep is a linear-time algorithm in the size of the
tree, with respect to the height of
 the tree rep requires exponential time in the worst case. In their subsequent
comparative
 analysis Esposito et al. {1997} sketched a proof for Quinlan's {1987} claim
that the pruning produced by rep is the smallest among the most accurate
prunings of the given decision
 tree.
 The bias of rep was brie035y examined by Oates and Jensen {1997, 1998}.
They observed that the error, r
 L , of the best ma jority leaf that could replace a subtree T only depends
on
 {the class distribution of } the examples that reach the root N of T . In
other words, the tree
 structure above T and N decides the error r
 L
 . Let r
 T
 denote the error of the subtree T at
 the moment when the pruning sweep reaches N ; i.e., when some pruning may
already have takenplace in T . All pruning operations performed in T have
led either r
 T
 to decrease from
 the initial situation or to stay unchanged. In any case, pruning that has
taken place in T
 potentially decreases r
 T
 , but does not a033ect r
 L
 . Hence, the probability that r T
 < r
 L 026 i.e.,
 that T will not be pruned 026 increases through pruning in T . This error
propagation bias
 is inherent to rep. Oates and Jensen {1997, 1998} conjecture that the larger
the original tree and the smaller the pruning set, the larger this e033ect,
because a large tree provides
 more pruning opportunities and the high variance of a small pruning set
o033ers more random
 chances for r L
 024 r
 T . Subsequently we study some of these e033ects exactly.
 In a follow-up study Oates and Jensen {1999} used rep as a vehicle for explaining
the problems that have been observed in the pruning phase of top-down induction
of decision trees. They analyzed rep in a situation in which the decision
node under consideration 034ts
 noise 026 i.e., when the class of the examples is independent of the value
of the attribute tested
 in the node at hand 026 and built a statistical model of rep in this situation.
It indicates,
 169Elomaa & K344344ri344inen consistently with their earlier considerations,
that even though the probability of pruning a node that 034ts noise prior
to pruning beneath it is close to 1, pruning that occurs beneath the
 node reduces its pruning probability close to 0. In particular, this model
shows that if even
 one descendant of node N at depth d is not pruned, then N will not be pruned
{assuming there are no leaves until depth d + 1}. The consequence of this
result is that increasing depth
 d leads to an exponential decrease of the node's pruning probability.
 The 034rst part of Oates and Jensen's {1999} analysis is easy to comprehend,
but its signif- icance is uncertain, because this situation does not rise
in any bottom-up pruning strategy.
 The statistical model is based on the assumption that the number, n, of
pruning instances
 that pass through the node under consideration is large, in which case 026
independence as- sumptions prevailing 026 the errors committed by the node
can be approximated by the normal
 distribution. The expected error of the original tree is the mean of the
distribution, while, if
 pruned to a leaf, the tree would misclassify a proportion of the n examples
that corresponds
 to that of the minority class. Oates and Jensen show that the latter number
is always less
 than the mean of the standard distribution of errors. Hence, the probability
of pruning is
 over0.5 and approaches 1 as n grows. In the second part of the analysis,
in considering the pruning probability of a node N
 after pruning has taken place beneath it, Oates and Jensen assume that the
proportion of
 positive examples in any descendant of N at depth d is the same as in N
. In this setting,
 assuming further that N has a positive ma jority, all its descendants at
level d also have a
 positive ma jority. It directly follows that if all descendants at level
d are pruned, they are
 all replaced by a positive leaf. Hence, the function represented by this
pruning is identically
 positive. The ma jority leaf that would replace N also represents the same
function and is
 smallerthan the above pruning. Therefore, rep will choose the single leaf
pruning. On the
 other hand, if one or more of the descendants of N at depth d are not pruned,
then the
 pruning of the tree rooted at N , in which these subtrees are maintained
and all other nodes
 at level d are pruned into positive leaves, is more accurate than the ma
jority leaf. In this
 casethe tree will not be pruned.
 Oates and Jensen {1999} also assume that starting from any node at level
d the proba-
 bility of routing an example to a positive leaf is the same. In the following
analyses we try to rid all unnecessary assumptions; the same results can
be obtained without any knowledge of the example distribution.
 4. Basic Properties of rep Beforegoing to the detailed probabilistic analysis
of the rep algorithm, we examine some
 of its basic algorithmic properties. Throughout this paper we review the
binary case for
 simplicity. The results, however, alsoapply with many-valued attributes
and several classes. Now that the processing control of the rep algorithm
has been settled, we can actually
 prove Quinlan's {1987} claim of the optimality of the pruning produced by
rep. Observe
 that the following result holds true independent of the leaf labeling strategy.
 Theorem 1 Applying rep with a set of pruning examples, S , to a decision
tre e T pro duc es T
 0
 026 a pruning of T 026 such that it is the smal lest of those prunings
of T that have minimal
 error with resp e ct to the example set S .
 170An Analysis of Reduced Error Pruning
 Proof We prove the claim by induction over the size of the tree. Observe
that a decision
 tree T is a full binary tree, which has 2L{T } 000 1 nodes, where L{T }
is the number of leaves
 in the tree.
 Base case. If L{T } = 1, then the original tree T consists of a single leaf
node. T is
 the only possible pruning of itself. Thus, it is, trivially, also the smallest
among the most accurate prunings of T .
 Inductive hypothesis. The claim holds when L{T } < k.
 Inductive step. L{T } = k. Let N be the root of the tree and T
 0
 and T
 1
 the left and the right subtree, respectively. Subtrees T
 0
 and T
 1
 must have strictly less than k leaves. When the pruning decision for N is
taken, then 026 by the bottom-up recursive control strategy of
 rep 026 T
 0
 and T
 1
 have already been processed by the algorithm. By the inductive hypothesis,
the subtrees after pruning, T 0
 0
 and T 0
 1
 , are the smallest possible among the most accurate
 prunings of these trees.
 {i}: Ac cur acy. The pruning decision for the node N consists of choosing
whether to collapse
 N and the tree rooted at it into a ma jority leaf, or whether to maintain
the whole
 tree. If both alternatives make the same number of errors, then N is collapsed
and the
 original accuracy with respect to the pruning set is retained. Otherwise,
by the rep algorithm, the pruning decision is based on which of the resulting
trees would make less errors with respect to the pruning set S . Hence, whichever
choice is made, the
 resulting tree T 0
 will make the smaller number of errors with respect to S .
 Let us now assume that a pruning T
 00
 of T makes even less errors with respect to S
 than T
 0 . Then T
 00 must consist of the root N and two subtrees T
 00
 0
 and T 00
 1
 , because the ma jority leaf cannot be more accurate than T
 0
 . Since T
 00
 is a more accurate pruning of
 T than T 0
 , it must be that either T
 00
 0
 is a more accurate pruning of T
 0 than T
 0
 0 or T
 00
 1 is
 a more accurate pruning of T
 1
 than T 0
 1
 . By the inductive hypothesis both possibilities
 are false. Therefore, T 0
 is the most accurate pruning of T .
 {ii}: Size. To see that the chosen alternative is also as small as possible,
034rst assume that T
 0
 consists of a single leaf. Such a tree is the smallest pruning of T , and
in this case the
 claim follows. Otherwise, T
 0
 consists of the root node N and the two pruned subtrees
 T
 0
 0
 and T
 0
 1
 . Since this tree was not collapsed, the tree must be more accurate than
the tree consisting of a single ma jority leaf. Now assume that there exists
a pruning T
 003
 ofT that is as accurate as T 0
 , but smaller. Because the ma jority leaf is less accurate than T
 0
 , T
 003
 must consist of the root node N and two subtrees T
 003
 0
 and T
 003
 1 . Then, either
 T 003
 0
 is a smaller pruning of T
 0
 than T
 0
 0
 , but as accurate, or T
 003
 1
 is a smaller pruning of T
 1
 than T 0
 1
 , but as accurate. Both cases contradict the inductive hypothesis. Hence,
 T
 0 is the smallest among the most accurate prunings of T .
 Thus, in any case, the claim follows for T . 2
 We consider next the situation in an internal node of the tree, when the
bottom-up
 pruning sweep reaches the node. From now on we are committed to leaf labeling
by the
 ma jority of the pruning examples.
 171Elomaa & K344344ri344inen Theorem 2 An internal node, which prior
to pruning had no leaves as its children, wil l not
 be pruned by rep if it has a non-trivial subtre e when the bottom-up pruning
sweep re aches
 it.
 Proof For an internal node N we have two possible cases in which it has
non-trivial
 subtrees; either both its subtrees are non-trivial {non-leaf } or one of
them is trivial. Let us
 review these cases.
 Let r
 T
 denote the error of {sub}tree T with respect to the part of the pruning
set that
 reaches the root of T . By r
 L we denote the misclassi034cation rate of the ma jority leaf L that
 would replace T , if T was chosen to be pruned. Case I: Let the two subtrees
of T , T
 0
 and T
 1
 , be non-trivial. Hence, both of them have been retained when the pruning
sweep has passed them. Thus, r
 T 0
  0:5.
 In the following we will analyze the situation in which the subtrees rooted
at safe nodes
 have already been pruned into leaves. We bound the pruning probability of
the tree starting
 from this initial con034guration. Since the bottom-up pruning may already
have come to
 ahalt before that situation, the following results actually give too high
a probability for
 pruning. Hence, the following upper bounds are not as tight as possible.
 175Elomaa & K344344ri344inen We consider pruning a decision tree by rep
as a trial whose result is decided by the set
 of pruning examples. By Theorem 4 we can approximate the probability that
a tree will
 be pruned into a ma jority leaf by approximating the probability that al
l safe nodes get a
 positive ma jority or a negative ma jority. The latter alternative is not
very probable under
 the assumption p > :5. It is safe to assume that it never happens.
 We can consider sampling the pruning examples in two phases. First the attribute
values
 are assigned. This decides the leaf into which the example falls. In the
second phase we independently assign the class label for the example.
 Let the safe nodes of tree T be Z {T } = fz
 1
 ; : : : ; z
 k
 g and let the number of examples in the pruning set S be jS j = n. The number
of pruning examples falling to a safe node z
 i
 is denoted by n
 i ;
 P
 k
 i=1
 n
 i
 = n. For the time being we assume that n
 i
 > 0 for all i. The number of positive examples falling to safe node z
 i
 is the sum of independent Bernoulli variables
 and, thus, it is binomially distributed with parameters n
 i
 and p. Respectively, the number of negative pruning examples in safe node
z
 i
 is X
 i
 030 B {n i
 ; 1 000 p}. The probability that there is a ma jority of negative examples
in safe node z
 i
 is Prf X
 i > n
 i
 =2 g. We can bound this probability from below by using the following inequality
{Slud, 1977}. Lemma 5 {Slud's inequality} Let X 030 B {m; q} be a random
variable with q 024 1=2. Then
 for m{1 000 q} 025 h 025 mq,
 Prf X 025 h g 025 1 000 010
  
 h 000 mqpmq{1 000 q} !
 :
 Since p > :5 and the random variable corresponding to the number of negative
examples
 in safe node z i
 is X
 i 030 B {n
 i
 ; 1000p}, the 034rst condition of Slud's inequality holds. Furthermore,
 to see that condition m{1 000 q} 025 h 025 mq holds in safe node z
 i
 substitute h = n
 i
 =2, m = n
 i ,
 and q = 1 000 p to obtain n
 i
 p 025 n
 i
 =2 025 n
 i
 {1000 p}. Thus,
 Pr
 032 X
 i
 >
 n
 i2 033
 025 1 000 010
  
 n
 i
 =2 000 n
 i
 {1 000 p}pn
 i
 p{1 000 p}
 !
 = 1000 010
  
 {p 000 1=2}n ipn
 i
 p{1 000 p}
 ! : {1}
 As n i
 , the number of pruning instances reaching safe node z i
 , grows, then the standard normal distribution term in the above bound also
grows. Hence, the bound on the probability
 that the ma jority of the pruning examples reaching z i
 is negative is the smaller the more
 pruning examples reach it. The probability of a negative ma jority also
reduces through the
 growing probability of positive class for an example, p. These both are
also re035ected in the
 pruning probabilities of the whole tree.
 We can now roughly approximate the probability that T will be pruned into
a single
 ma jority leaf as follows. By Theorem 4, T will be pruned into a leaf if
and only if each safe
 node in T receives a ma jority of positive examples. Because T has k safe
nodes and there aren pruning examples, then according to the pigeon-hole
principle at least half of the safe
 nodes receive at most r = 2n=k examples. Each safe node z
 i
 with n i
 024 r examples has, by
 Inequality 1, a negative ma jority at least with probability 1 000 010
   {p 000 1=2}rprp{1 000 p} !
 :
 176An Analysis of Reduced Error Pruning
 Observe that Inequality 1 also holds when n
 i  :5. In
 order to obtain tighter bounds, one has to make assumptions about the shape
of the tree T
 and the distribution of examples. The bound of Equation 2 depends on the
size of the decision tree {re035ected by k}, the
 number {n} and the class distribution {p} of the pruning examples. Keeping
other parameters
 constant and letting k grow reduces the pruning probability exponentially.
If the number of pruning examples grows in the same proportion so that r
= 2n=k staysconstant, the
 pruning probability still falls exponentially. Class distribution of the
pruning examples also
 a033ects the pruning probability which is the smaller, the closer p is
to value .5.
 5.3 Implications of the Analysis
 It has been empirically observed that the size of the decision tree grows
linearly with the training set size, even when the trees are pruned {Catlett,
1991; Oates & Jensen, 1997, 1998}. The above analysis gives us a possibility
to explain this behavior. However, let us
 034rst prove that when there is no correlation between the attribute values
and the class label
 of an example, the size of the tree that perfectly 034ts the training data
depends linearly on
 the size of the sample. Our setting is as simple as can be. We only have
one real-valued attribute x and the class
 attribute y, whose value is independent of that of x. As before, y has two
possible values, 0 and 1. The tree is built using binary splits of a numerical
value range; i.e., propositions
 of type 020 x < r021 are assigned to the internal leaves of the tree.
In this analysis duplicate
 instances occur with probability 0.
 Theorem 6 Let the training examples {x; y} be drawn from a distribution,
where x is uni-
 formly distributed in the range [0; 1} and y obtains value 1, independent
of x, with prob ability p, and value 0 with prob ability 1 000 p. Then the
expe cte d size of the decision tre e that 034ts the data is linear in the
size of the sample.
 Proof Let S = h{x
 1
 ; y
 1
 }; : : : ; {x
 t
 ; y t
 }i be a sample of the above described distribution. We
 may assume that x i
 6= x
 j
 , when i 6= j , because the probability of the complement event is 0.
 Let us, further, assume that the examples of S have been indexed so that
x
 1
 < x 2
 < : : : < x t
 .
 Let A i
 be the indicator variable for the event that instances i and i + 1 have
di033erent class
 labels; i.e., y
 i
 6= y
 i+1
 , 1 024 i 024 t0001. Then EA i
 = Prf A i
 = 1 g = p{1000p}+{1000p}p = 2p{1000p},
 177Elomaa & K344344ri344inen because when the event y i
 = 1 has probability p, at the same time the event y
 i+1
 =0 has probability 1 000 p, and vice versa. Now the number of class alternations
is A =
 P t0001
 i=1 A
 i
 and
 its expectation is
 EA =
 t0001
 X i=1
 EA
 i
 =
 t0001 X
 i=1
 2p{1 000 p} = 2p{1000 p}
 t0001
 X
 i=1 1 = 2{t000 1}p{1 000 p}: {3} Let T be a decision tree that has been
grown on the sample S . The growing has been
 continued until the training error is 0. Each leaf in T corresponds to a
half open interval
 [a; b} in [0; 1}. If y
 i 6= y
 y+1 , then x
 i
 and x
 i+1
 must fall into di033erent leaves of T , because
 otherwise one or the other example is falsely classi034ed by T . Thus,
the upper boundary b of
 the interval corresponding to the leaf into which x
 i
 falls in must have a value less than x
 i+1 .
 Repetitively applying this observation when scanning through the examples
from left to
 right, we see that T must at least have one leaf for x
 1
 and one leaf for each class alternation;
 i.e., A + 1 leaves in total. By using Equation 3 we see that the expected
number of leaves
 in T is
 EA + 1 = 2{t000 1}p{1 000 p} + 1:
 In particular, this is linear in the size of the sample S ; jS j = t. 2
 The above theorem only concerns zero training error trees built in the 034rst
phase of
 decision tree induction. The empirical observations of Catlett {1991} and
Oates and Jensen
 {1997, 1998}, however, concern decision trees that have been pruned in the
second phase of
 induction. We come back to the topic of pruned trees shortly.
 Consider how rep is used in practice. There is some amount of {classi034ed}
data available
 from the application domain. Let there be a total of t examples available.
Some part ff of the data is used for tree growing and the remaining portion
1 000 ff of it is reserved as the
 separate pruning set; 0< ff < 1. Quite a common practice is to use two thirds
of the data
 for growing and one third for pruning or nine tenths for growing and one
tenth for pruning
 when {ten-fold} cross-validation is used. In the decision tree construction
phase the tree is
 034tted to the fft examples as perfectly as possible. If we hypothesize
that the previous result
 holds for noisy real-world data sets, which by empirical evidence would
appear to be the
 case, and that the number of safe nodes also grows linearly with the number
of leaves, then the tree grown will contain fl t safe nodes, where fl > 0.
Since the pruning set size also is
 a linear fraction of the training set size, the ratio r = 2n=k staysconstant
in this setting.
 Hence, by Equation 2, the growing data set size forces the pruning probability
to zero, even
 quite fast, because the reduction in the probability is exponential.
 5.4 Limitations of the Analysis
 Empty subtrees, which do not receive any pruning examples, were left without
attention
 above; we assumed that n
 i
 > 0 for each i. Empty subtrees, however, decisivelya033ect the
 analysis; they are automatically pruned away. Unfortunately, one cannot
derive a non-trivial upper bound for the number of empty subtrees. In the
worst case all pruning examples are
 routed to the same safe node, which leaves k 000 1 empty safe nodes to
the tree. Subsequently
 we review the case where the examples are distributed uniformly to the safe
nodes. Then
 better approximations can be obtained.
 178An Analysis of Reduced Error Pruning
 Even though we assume that each pruning example is positive with a higher
probability
 than .5, there are no guarantees that the ma jority of all examples is positive.
However, the probability that the ma jority of all examples changes is very
small, even negligible, by
 Cherno033 's inequality {Cherno033, 1952; Hagerup & R374b, 1990} when
the number of pruning
 examples, n, is high and p is not extremely close to one half.
 Slud's inequality bounds the probability Prf X 025 h g, but above we used
it to bound
 the probability Prf X > h g. Some continuity correction could be used to
compensate this.
 In practice, the inexactness does not make any di033erence.
 Even though it would appear that the number of safe nodes increases in the
same pro- portion as that of leaves when the size of the training set grows,
we have not proved this
 result. Theorem 6 essentially uses leaf nodes, and does not lend itself
to modi034cation, where safe nodes could be substituted in place of leaves.
 The relation between the number of safe nodes and leaves in a decision tree
depends on
 the shape of the tree. Hence, the splitting criterion that was used in tree
growing decisively a033ects this relation. Some splitting criteria aim at
keeping the produced split as balanced as
 possible, while others aim at separating small class coherent subsets from
the data {Quinlan,
 1986; Mingers, 1989b}. For example, the common entropy-based criteria have
a bias that
 favors balanced splits {Breiman, 1996}. Using a balanced splitting criterion
would seem to imply that the number of safe nodes in a tree depends linearly
on the number of leaves in
 the tree. In that case the above reasoning would explain the empirically
observed linear growth of pruned decision trees. 6. Pruning Probability Under
Uniform Distribution
 We now assume that all n pruning examples have an equal probability to end
up in each
 of the k safe nodes; i.e., a pruning example falls to the safe node z
 i with probability 1=k. Contrary to the normal uniform distribution assumption
analysis, for our analysis this is not the best case. Here the best distribution
of examples into safe nodes would have one pruning
 example in each of the safe nodes except one, into which all remaining pruning
instances
 would gather. Nevertheless, the uniformity lets us sharpen the general approximation
by
 using standard techniques.
 The expected number of examples falling into any safe node is n=k. Let us
calculate the expected number of those safe nodes that receive at most cn=k
examples, where c is an
 arbitrary positive constant. Let Q
 i
 be the indicator for the event 020safe node z
 i
 receives at most cn=k examples.021 Then Q =
 P
 k i=1
 Q
 i is the number of those safe nodes that receive less
 than cn=k examples. By the linearity of expectation EQ =
 P
 k i=1
 EQ
 i
 = kEQ
 1
 , in which
 the last equality follows from the fact that the Q
 i
 -s are identically distributed.
 Let Y
 1 be the number of examples reaching safe node z
 1
 . Because each of the n exam- ples reaches z
 1 with probability 1=k independent of the other examples, Y
 1 is binomially
 distributed with parameters n and 1=k. Clearly EQ
 1
 = Prf Y
 1 024 cn=k g. We can approxi-
 mate the last probability by the normal approximation, from which we obtain
 Pr
 032
 Y
 1
 024
 cnk
 033 031 010
  
 cn=k 000n=kpn 001 1=k 001{1 000 1=k}
 !
 = 010  
 {c000 1}n=kpn=k{1 000 1=k}
 !
 :
 179Elomaa & K344344ri344inen Hence, by the above observation,
 EQ = kEQ
 1
 031 k010  
 {c000 1}n=kpn=k{1 000 1=k}
 !
 : {4} We now use Approximation 4 to determine the probability that the whole
decision tree
 T will be pruned into a single leaf. Let P be a random variable that represent
the number ofthose safe nodes in T that receive at most cn=k examplesand
at least one example. If we
 denote by R the number of empty safe nodes, we have P = Q 000 R. Hence,
EP = E{Q 000 R} = EQ 000 ER. The following result {Kamath, Motwani, Palem,
& Spirakis, 1994; Motwani & Raghavan,
 1995} lets us approximate the number of empty safe nodes when n 035 k.
 Theorem 7 Let Z be the number of empty bins when m bal ls are thrown randomly
into h
 bins. Then
 026 = EZ = h 022
 1000
 1h
 
 m
 031 he 000m=h
 and for 025 > 0,
 Prf jZ 000 026j 025 025 g 024 2 exp
  
 000
 025
 2
 {h 000 1=2}h 2
 000 026
 2
 !
 :
 By this result the expected number of empty safe nodes is approximately
ke 000n=k
 ; this
 number is small when k isrelatively small compared to n.
 Substituting the above obtained approximation for EQ {Equation 4} and using
the pre- vious result, we get
 EP = EQ 000 ER 031 k
  
 010
  
 {c000 1}n=kpn=k{1 000 1=k}
 !
 000 e
 000n=k
 ! :
 Applying Slud's inequality we can, as before, bound from above the probability
that
 the ma jority class does not change in a safe node that receives cn=k pruning
examples.
 Since there are P such safe nodes and the class distribution of examples
within them is
 independent, the event 020ma jority class does not change in any safe node
that receives at
 least one and at most cn=k examples021 has the upper bound
  
 010
  
 {p 000 :5}rprp{1 000 p}
 !! P
 ; {5}
 where r = cn=k. Replacing P with its expected value in this equation we
have an approxi- mation for the pruning probability. This approximation isvalid
if P does not deviate a lot from its expected value. We consider the deviation
of P from its expected value below. The above upper bound for the pruning
probability is similar to the upper bound that
 was obtained without any assumptions about the distribution of the examples.
However, the
 earlier constant 2 has been replaced by a new, controllable parameter c,
and empty subtrees are now explicitly taken into account. If c is chosen
suitably, this upper bound is more strict than the one obtained in the general
case. 180An Analysis of Reduced Error Pruning
atendUpper bound for the pruning probability     0.5    0.250.511.50.50.60.70.80.900.51cpFigure
3: The e033ect of parameters p and c on the upper bound of the pruning probability
 of a tree with 100 safe nodes when 500 pruning examples are used. The curves
 depictingthe 0.25 and 0.5 upper bounds are also shown.
 6.1 An Illustration of the Upper Bound Figure 3 plots the upper bound of
the pruning probability of a tree with 100 safe nodes when
 500 pruning examples are used. The value of the parameter c varies from
0 to 2 and p varies
 from 0.5 to 1. We can observe that the surface corresponding to the upper
bound stays very close to 0 when the class distribution is not too skewed
and when the parameter c does not
 have a very small value. When the probability of an example having a positive
class label hits value 0.75 or the value of c approaches 0, the upper bound
climbs very steeply. At least
 on the part of the parameter c this is due to the inexactness of the approximation
on the
 extreme values. When the probability p that an example has a positive class
approaches 1, the error
 committed by a single positive leaf falls to 0. Hence, the accuracy of a
non-trivial pruning
 has to be better, the closer p is to 1 for it to beat the ma jority leaf.
Intuitively, theprobability
 that such a pruning exists 026 i.e., that the root node is not pruned 026
should drop to zero as
 p increases. The bound re035ects this intuition.
 Whenthe value of parameter c falls close to 0, the safe nodes that are taken
into account in the upper bound only receive very few pruning examples. The
number of such nodes is
 181Elomaa & K344344ri344inen small. On the other hand, when c is increased,
the number of nodes under consideration
 grows together with the upper limit on the number of examples reaching each
single one of
 them. Thus, both small and large values of c yield loose bounds. In the
strictest bounds
 the value of c is somewhere in 020the middle,021 in our example around
values 1.00251.5. In the
 boundof Equation 5 the argument of the cumulative distribution function
010 tends towards
 zero when the value of c is very small, but at the same time the exponent
decreases. The
 value of 010 approaches 1/2, when its argument goes to zero. On the other
hand, when c has
 a large value, 010 approaches value 1 and the exponent P also increases.
 6.2 On the Exactness of the Approximation Above we used the expected value
of P in the analysis; EP = EQ 000 ER. We now probe into the deviation of
P from its expected value. The deviation of R is directly available from
Theorem 7:
 Prf jR 000 ERj 025 025 g 024 2 exp
   000
 025
 2 {k 0001=2}k
 2 000 E
 2
 R !
 :
 For Q we do not have a similar result yet. In this section we provide one.
Let us 034rst recapitulate the de034nition of the Lipschitz condition.
 De034nition Let f : D 1
 002 001 001 001 002 D
 m
 ! IR be a real-valued function with m arguments from
 possibly distinct domains. The function f is said to satisfy the Lipchitz
condition if for any
 x
 1 2 D
 1
 ; : : : ; x
 m
 2 D
 m , any i2 f1; : : : ; mg, and any y
 i
 2 D
 i
 ,
 jf {x
 1
 ; : : : ; x i0001
 ; x i
 ; x
 i+1 ; : : : ; x
 m } 000 f {x 1
 ; : : : ; x i0001
 ; y i
 ; x
 i+1 ; : : : ; x
 m }j 024 1:
 Hence, a function satis034es the Lipschitz condition if an arbitrary change
in the value of
 any one argument does not change the value of the function more than 1.
 The following result {McDiarmid, 1989} holds for functions satisfying the
Lipschitz con- dition. More general results of the same kind can be obtained
using martingales {see e.g.,
 {Motwani & Raghavan, 1995}).
 Theorem 8 {McDiarmid} Let X
 1
 ; : : : ; X
 m
 be independent random variables taking values
 in a set V . Let f : V m
 ! IR be such that, for i = 1; : : : ; m:
 sup
 x
 1
 ;:::;x
 m
 ;y
 i
 2V
 jf {x
 1
 ; : : : ; x
 i0001
 ; x
 i
 ; x
 i+1 ; : : : ; x
 m } 000 f {x 1
 ; : : : ; x i0001
 ; y i
 ; x
 i+1 ; : : : ; x
 m }j 024 c
 i :
 Then for 025 > 0,
 Prf jf {X
 1
 ; : : : ; X
 m
 } 000 Ef {X
 1
 ; : : : ; X
 m
 }j 025 025 g 024 2 exp
  
 000
 2025
 2P
 m
 i=1 c
 2
 i
 !
 :
 Let W i
 , i = 1; : : : ; n, be a random variable such that W
 i = j if the i-th example is
 directed to the safe node z
 j
 . By the uniform distribution assumption W
 i -s are independent.
 They have their values within the set f1; : : : ; kg. Let us de034ne the
function f so that
 f {w 1
 ; : : : ; w n
 } is the number of those safe nodes that receive at most r =cn=k examples,
 when the i-th example is directed to the safe node z
 w
 i
 . That is,
 f {w
 1
 ; : : : ; w n
 } = jf i 2 f 1; : : : ; k g j jS
 i
 j 024 r gj;
 182An Analysis of Reduced Error Pruning
 where S i
 is the set of those examples that are directed to safe node z
 i
 ;
 S
 i = f h 2 f 1; : : : ; n g j w
 h
 = i g:
 Hence, Q = f {W 1
 ; : : : ; W n
 }. Moving any one example from one safe node to another {chang-
 ing the value of any one argument w
 i }, can change one more safe node z
 i
 to ful034ll the con- dition jS
 i
 j 024 r, one less safe node to ful034ll it, or both at the same time.
Thus, the value
 of f changes by at most 1. Hence, the function ful034lls the Lipschitz
condition. Therefore,
 we can apply McDiarmid's inequality to it by substituting c
 i
 = 1 and observing that then
 P
 n
 i=1
 c
 2
 i
 = n:
 Prf jf {W
 1 ; : : : ; W
 n } 000 Ef {W
 1
 ; : : : ; W n
 }j 025 025 g 024 2e
 0002025
 2
 =n ;
 or equally
 Prf jQ 000 EQj 025 025 g 024 2e
 0002025
 2
 =n
 :
 Unfortunately, this concentration bound is not very tight. Nevertheless,
combining the concentration bounds for Q and R we have for P the following
deviation from its expected value.
 Since jP 000 EP j = jQ 000 R 000 E{Q 000 R}j = jQ 000 EQ + ER 000
Rj 024 jQ000 EQj + jR 000 ERj,
 jQ 000 R 000 E{Q 000 R}j 025 025 implies that jQ 000 EQj 025 025=2
or jR 000 ERj 025 025=2. Thus, Prf jP 000 EP j 025 025 g = Prf jQ 000
R 000 E{Q 000 R}j 025 025 g
 024 Pr 032
 jQ 000 EQj 025
 0252
 033 + Pr
 032
 jR 000 ERj 025 0252
 033
 024 2 exp  
 000
 025 22n !
 + 2 exp
  
 000
 025
 2
 {k 0001=2}4{k 2
 000 E
 2 R}
 !
 : 7. Related Work
 T raditional pruning algorithms 026 like cost-complexity pruning {Breiman
et al., 1984}, pes-
 simisticpruning {Quinlan, 1987}, minimum error pruning {Niblett & Bratko,
1986; Cestnik & Bratko, 1991}, critical value pruning {Mingers, 1989a}, and
error-based pruning {Quinlan,
 1993} 026 have already been covered extensively in earlier work {Mingers,
1989a; Esposito
 et al., 1997; Frank, 2000}. Thus we will not touch on these methods any
further. Instead,
 we review some of the more recent work on pruning.
 rep produces an optimal pruning of the given decision tree with respect
to the pruning
 set. Other approaches for producing optimal prunings have also been presented
{Breiman
 et al., 1984; Bohanec & Bratko, 1994; Oliver & Hand, 1995; Almuallim, 1996}.
However,
 often optimality is measured over the training set. Then it is only possible
to maintain the initial accuracy, assuming that no noise is present. Neither
is it usually possible to reduce
 the size of the decision tree without sacri034cing the classi034cation
accuracy. For example, in
 the work of Bohanec and Bratko {1994} it was studied how to e036ciently
034nd the optimal
 pruning in the sense that the output decision tree is the smallest pruning
which satis034es a given accuracy requirement. A somewhat improved algorithm
for the same problem was
 presented subsequently by Almuallim {1996}.
 183Elomaa & K344344ri344inen The high level control of Kearns and Mansour's
{1998} pruning algorithm is the same bottom-up sweep as in rep. However,
the pruning criterion in their method is a kind of a
 cost-c omplexity condition {Breiman et al., 1984} that takes both the observed
classi034cation
 error and {sub}tree complexity into account. Moreover, their pruning scheme
does not
 require the pruning set to be separate from the training set. Both Mansour's
{1997} and
 Kearnsand Mansour's {1998} algorithms are pessimistic : they try to bound
the true error
 of a {sub}tree by its training error. Since the training error is by nature
optimistic, the
 pruning criterion has to compensate it by being pessimistic about the error
approximation.
 Consider yet another variant of rep, one which is otherwise similar to the
one analyzed above, with the exception that the original leaves are not put
to a special status, but can
 berelabeled by the ma jority of the pruning examples just like internal
nodes. This version of rep produces the optimal pruning with respect to which
the performance of Kearns and
 Mansour's{1998} algorithm is measured. Their pessimistic pruning produces
a decision tree
 that is smaller than that produced by rep.
 Kearns and Mansour {1998} are able to prove that their algorithm has a strong
perfor-
 mance guarantee. The generalization error of the produced pruning is bounded
by that of
 the best pruning of the given tree plus a complexity penalty. The pruning
decisions are
 local in the same sense as those of rep and only the basic pruning operation
of replacing a
 subtree with a leaf is used in this pruning algorithm.
 8. Conclusion In this paper the rep algorithm has been analyzed in three
di033erent settings. First, we
 studied the algorithmic properties of rep alone, without assuming anything
about the input
 decision tree nor pruning set. In this setting it is possible to prove that
rep ful034lls its
 intended task and produces an optimal pruning of the given tree. The algorithm
proceeds to prune the nodes of a branch as long as both subtrees of an internal
node are pruned and
 stops immediately if even one subtree is kept. Moreover, it prunes an interior
node only if
 all its descendants at level d have been pruned. Furthermore, rep either
halts before the
 safe nodes are reached or prunes the whole tree only in case all safe nodes
have the same
 ma jority class. In the second setting the tree under consideration was
assumed to 034t noise; i.e., it
 was assumed that the class label of the pruning examples is independent
of their attribute
 values. In this setting the pruning probability of the tree could be bound
by an equation that depends exponentially on the size of the tree and linearly
on the number and class
 distributionof the pruning examples. Thus, our analysis corroborates the
main 034nding of
 Oates and Jensen {1999} that rep fails to control the growth of a decision
tree in the extreme
 case that the tree 034ts pure noise. Moreover, our analysis opened a possibility
to initially
 explain why the learned decision tree grows linearly with an increasing
data set. Our bound
 on the pruning probability of a tree is based on bounding the probability
that all safe nodes
 have the same ma jority class. Surprisingly, essentially the same property,
whose probability we try to bound close to 0, is assumed to hold with probability
1 in the analysis of Oates
 and Jensen {1999}.
 In rep it may happen that no pruning examples are directed to a given subtree.
Such subtrees have not been taken into account in earlier analyses. In our
034nal analysis we
 184An Analysis of Reduced Error Pruning
 included empty subtrees in the equation for a tree's pruning probability.
Taking empty
 subtrees into account gives a more realistic bound for the pruning probability
of a tree. Unfortunately, one cannot draw very de034nite general conclusions
on the two-phased top-
 down induction of decision trees on the basis of analyses on the rep algorithm,
because its
 bias is quite unique among pruning algorithms. The fact that rep does not
penalize the
 size of a tree, but only rests on the classi034cation error on the pruning
examples makes the method sensitive to small changes in the class distribution
of the pruning set. Other decision
 tree pruning algorithms also have their individual characteristics. Therefore,
uni034ed analysis
 of decision tree pruning may be impossible. The version of rep, in which
one is allowed to relabel original leaves, as well, is used
 as the performance ob jective in Kearns and Mansour's {1998} pruning algorithm.
Thus,
 the performance of pruning algorithms that use both error and size penalty
is related to
 thosethat use only error estimation. In the version of rep used by Kearns
and Mansour our analysis based on safe nodes applies with leaves in place
of safe nodes. Hence for this
 algorithm the derived bounds are stricter.
 We leave the detailed analysis of other important pruning algorithms as
future work.
 Only through such investigation is it possible to disclose the di033erences
and similarities of pruning algorithms. Empirical examination has not managed
to reveal clear performance di033erences between the methods. Also, the
relationship of the number of safe nodes and
 leaves of a tree ought to be examined analytically and empirically. In particular,
one should
 study whether the number of safe nodes does increase linearly with a growing
training set, as conjectured in this paper. Deeper understanding of existing
pruning algorithms may help
 to overcome the problems associated with the pruning phase of decision tree
learning.
 References
 Almuallim, H. {1996}. An e036cient algorithm for optimal pruning of decision
trees. Arti034cial Intel ligence, 83, 347025362.
 Bohanec,M., & Bratko, I. {1994}. Trading accuracy for simplicity in decision
trees. Machine
 Le arning, 15 {3}, 223025250.
 Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. {1984}. Classi034cation
and
 Re gr ession Tr e es. Wadsworth, Paci034c Grove, CA.
 Breiman, L. {1996}. Some properties of splitting criteria. Machine Le arning,
24 {1}, 4102547.
 Catlett, J. {1991}. Overpruning large decision trees. In Pro c e e dings
of the Twelfth Interna-
 tional Joint Conferenc e on Arti034cial Intel ligence, pp. 764025769,
San Mateo, CA. Morgan Kaufmann.
 Cestnik, B., & Bratko, I. {1991}. On estimating probabilities in tree pruning.
In Kodrato033,
 Y. {Ed.}, Machine Le arning026EWSL-91: Pro c e e dings of the Fifth Europ
e an Working
 Session, Vol. 482 of Le ctur e Notes in Arti034cial Intel ligence, pp.
138025150, Berlin, Hei- delberg, New York. Springer-Verlag.
 Cherno033, H. {1952}. A measure of asymptotic e036ciency for tests of
a hypothesis based on
 the sum of observations. Annals of Mathematical Statistics, 23 {4}, 493025507.
 185Elomaa & K344344ri344inen Cohen, W. W. {1993}. E036cient pruning
methods for separate-and-conquer rule learning
 systems.In Pro c e e dings ofthe Thirteenth International Joint Conferenc
e on Arti034cial
 Intel ligence, pp. 988025994, San Mateo, CA. Morgan Kaufmann.
 Esposito, F., Malerba, D., & Semeraro, G. {1993}. Decision tree pruning
as a search in
 the state space. In Brazdil, P. B. {Ed.}, Machine Le arning: ECML-93, Pro
c e e dings of
 the Sixth Europ e an Conferenc e, Vol. 667 of Le ctur e Notes in Arti034cial
Intel ligence, pp.
 165025184, Berlin, Heidelberg, New York. Springer-Verlag.
 Esposito, F., Malerba, D., & Semeraro, G. {1997}. A comparative analysis
of methods for
 pruning decision trees. IEEE Tr ansactions on Pattern Analysis and Machine
Intel li-
 gence, 19 {5}, 476025491.
 Frank, E. {2000}. Pruning Decision Tr e es and Lists. Ph.D. thesis, University
of Waik ato,
 Department of Computer Science, Hamilton, New Zealand.
 Hagerup, T., & R374b, C. {1990}. A guided tour of Cherno033 bounds. Information
Pro c essing L etters, 33 {6}, 305025308.
 Helmbold, D. P., & Schapire, R. E. {1997}. Predicting nearly as well as
the best pruning of a decision tree. Machine Le arning, 27 {1}, 5102568.
Holte, R. C., Acker, L., & Porter, B. {1989}. Concept learning and the problem
of small
 disjuncts. In Pro c e e dings ofthe Eleventh International Joint Conferenc
e on Arti034cial
 Intel ligence, pp. 813025818, San Mateo, CA. Morgan Kaufmann.
 Kamath, A., Motwani, R., Palem, K., & Spirakis, P. {1994}. Tail bounds for
occupancy and the satis034ability threshold conjecture. In Pro c e e dings
ofthe Thirty-Fifth Annual
 IEEE Symposium on Foundations of Computer Science, pp. 592025603, Los Alamitos,
CA. IEEE Press.
 Kearns, M., & Mansour, Y. {1998}. A fast, bottom-up decision tree pruning
algorithm with
 near-optimal generalization. In Shavlik, J. {Ed.}, Pro c e e dings ofthe
Fifteenth Inter-
 national Conferenc e on Machine Le arning, pp. 269025277, San Francisco,
CA. Morgan
 Kaufmann. Malerba, D., Esposito, F., & Semeraro, G. {1996}. A further comparison
of simpli034cation methods for decision-tree induction. In Fisher, D., &
Lenz, H.-J. {Eds.}, Le arning from
 Data: AI and Statistics V, pp. 365025374, Berlin, Heidelberg, New York.
Springer-Verlag.
 Mansour, Y. {1997}. Pessimistic decision tree pruning based on tree size.
In Fisher, D. H.
 {Ed.}, Pro c e e dings of the Fourte enth International Conferenc e on Machine
Le arning,
 pp. 195025201, San Francisco, CA. Morgan Kaufmann.
 McDiarmid, C. J. H. {1989}. On the method of bounded di033erences. In Siemons,
J. {Ed.}, Surveys in Combinatorics: Invited Papers of the 12th British Combinatorial
Confer-
 ence, pp. 148025188, Cambridge, U.K. Cambridge University Press. Mingers,
J. {1989a}. An empirical comparison of pruning methods for decision tree
induction.
 Machine Le arning, 4 {2}, 227025243.
 Mingers, J. {1989b}. An empirical comparison of selection measures for decision-tree
induc- tion. Machine Le arning, 3 {4}, 319025342.
 Mitchell, T. M. {1997}. Machine Le arning. McGraw-Hill, New York.
 186An Analysis of Reduced Error Pruning
 Motwani, R., & Raghavan, P. {1995}. Randomize d Algorithms. Cambridge University
Press,
 New York. Niblett, T., & Bratko, I. {1986}. Learning decision rules in noisy
domains. In Bramer, M. A.
 {Ed.}, Rese ar ch and Development in Expert Systems III, pp. 2502534, Cambridge,
UK.
 CambridgeUniversity Press. Oates, T., & Jensen, D. {1997}. The e033ects
of training set size on decision tree complexity.
 In Fisher, D. H. {Ed.}, Pro c e e dings ofthe Fourte enth International
Conferenc e on
 Machine Le arning, pp. 254025261, San Francisco, CA. Morgan Kaufmann.
 Oates, T., & Jensen, D. {1998}. Large datasets lead to overly complex models:
An expla-
 nation and a solution. In Agrawal, R., Stolorz, P., & Piatetsky-Shapiro,
G. {Eds.},
 Pro c e e dings ofthe Fourth International Conferenc e on Know ledge Discovery
and Data Mining, pp. 294025298, Menlo Park, CA. AAAI Press.
 Oates, T., & Jensen, D. {1999}. Toward atheoretical understanding of why
and when
 decision tree pruning algorithms fail. In Pro c e e dings of the Sixteenth
National Confer-
 ence on Arti034cial Intel ligence, pp. 372025378, Menlo Park, CA/Cambridge,
MA. AAAI
 Press/MIT Press. Oliver, J. J., & Hand, D. J. {1995}. On pruning and averaging
decision trees. In Prieditis, A.,
 & Russell, S. {Eds.}, Pro c e e dings of the Twelfth International Conferenc
e on Machine Le arning, pp. 430025437, San Francisco, CA. Morgan Kaufmann.
 Pagallo, G., & Haussler, D. {1990}. Boolean feature discovery in empirical
learning. Machine
 Le arning, 5 {1}, 7102599.
 Pereira, F., & Singer, Y. {1999}. An e036cient extension to mixture techniques
for prediction
 and decision trees. Machine Le arning, 36 {3}, 183025199.
 Quinlan, J. R. {1986}. Induction of decision trees. Machine Le arning, 1,
81025106.
 Quinlan, J. R. {1987}. Simplifying decision trees. International Journal
of Man-Machine Studies, 27 {3}, 221025248. Quinlan, J. R. {1993}. C4.5:
Pro gr ams for Machine Le arning. Morgan Kaufmann, San
 Mateo, CA.
 Slud, E. V. {1977}. Distribution inequalities for the binomial law. TheAnnals
of Prob ability, 5 {3}, 404025412.
 187