Classification and Regression Trees, Cart TM - A user manual for identifying indicators of vulnerability to famine and chronic food insecurity + disk (IFPRI, 1999, 56 p.) |
THE PREDICTIVE ACCURACY OF CART
Accuracy is the most important feature of a classification tree. All classification procedures, however, including CART, can produce errors. The CART procedure does not make any distributional assumptions on covariates; hence, hypothesis testing is not possible. Confidence in CART´s performance, therefore, has to be based primarily on an assessment of the extent of misclassification it generates from data sets with known class distributions and on knowledge of and experience with the subject matter under study.
The best way to test the predictive accuracy of a tree is to take an independent test data set with known class distributions and run it down the tree and determine the proportion of cases missclassified. In empirical studies, the possibility of getting such a data set is remote. To overcome this difficulty, Breiman et al. (1984) provide three procedures for estimating the accuracy of tree-structured classifiers.
First, let
c(X) or c |
= |
a tree-structured classifier, where X is a vector of characteristics variables that describe an observation; |
R*[c(X)] |
= |
the classifier's "true" misclassification rate; and |
L |
= |
the learning sample (the sample data from which to construct a classification tree). |
The three estimation procedures below have two objectives: constructing a classification tree, c(X), and then finding an estimate of R*[c(X)].
1. Resubstitution Estimate, R[c(X)]. This estimates the accuracy of the true misclassification rate, R*[c(X)], as follows:
1 a. Build a classification tree, c(X), from the learning sample L, and save it.1 b. Apply the tree, c(X), to the data set from which it is built. That is, let the observations in the sample run down the tree one at a time.
1 c. Compute the proportion of cases that are misclassified. This proportion is the resubstitution estimate, R[c(X)], of the true misclassification rate, R*[c(X)].
The resubstitution estimate tests the accuracy of a classifier by applying it to observations for which the classes are known. The major weakness of this estimator of the error rate is that it is derived from the same data set from which the tree is built; hence, it underestimates the true misclassification rate. The error rate is always low in such cases.
2. Test-sample estimate. If the sample is large,
2 a. Divide the observations in the learning sample, L, into two parts: L_{1} and L_{2}. L_{1} and L_{2} need not be equal. For example, two-thirds of the cases in L can be chosen randomly to constitute L_{1} and the remaining one-third can constitute cases in L_{2}.
2 b. Use L_{1} to build the classifier, c(X), and save it.2 c. Run observations in L_{2} down the tree, c(X), one at a time.
2 d. Compute the proportion of cases that are misclassified. This proportion is the test-sample estimate, R[c(X)], of the "true" misclassification rate, R*[c(X)]. In large samples, this estimate provides an unbiased estimate of the true misclassification rate.
3. K-fold cross-validation. This is the recommended procedure for small samples and it works as follows:
3 a. Divide the learning sample, L, into K subsets of an equal number of observations. Let L_{1}, L_{2},..., L_{k} be the subsamples.3b. Construct a classifier, c(X), from the k-1 subsamples by leaving out, say, the kth subsample, L_{k}.
3c. Save the resulting classifier, c(X).
3d. Apply the saved classifier, c(X), to the excluded subsample, L_{k}, and estimate R[c(X)] as the proportion of misclassified observations. Denote this estimate as R^{ts}[c^{k}(X)], where k denotes k-fold cross-validation, and ts denotes test sample.
3e. Repeat steps 3b, 3c, and 3d, using all subsamples except the subsample L_{k-1}. The subsample L_{k-1} now becomes a test sample. The process above is repeated until every subsample is used once in the construction of c(X) and once as a test sample. The result is a series of test sample resubstitution estimates,
R^{ts}[c^{k}(X)], R^{ts}[c^{k-1}(X)],..., R^{ts}[c^{1}(X)].
3f. Add the series of R^{ts}[c^{k}(X)], R^{ts}[c^{k-1}(X)],..., R^{ts}[c^{1}(X)].generated from the k-fold cross-validation and get an estimate of R[c(X)]; that is, the k-fold cross-validation estimate R^{ck}(c) of R[c(X)] is given as
R^{ck}(c)= 1/k S_{k=1} R^{ts} [c^{(k)}],
which is an average of the error rates from k cross-validation tests. For example, if k =10, then the average is over the 10 test samples. Tenfold cross-validation is recommended.
METHODOLOGY FOR BUILDING A CLASSIFICATION TREE
In constructing a classification tree, CART makes use of prior probabilities (priors). A brief review of priors and their variations as used in CART is provided.
Prior probabilities play a crucial role in the tree-building process. Three types of priors are available in CART: priors data, priors equal, and priors mixed. They are either estimated from data or supplied by the analyst.
In the following discussion, let
N |
= |
number of cases in the sample, |
N_{j} |
= |
number of class j cases in the sample, and |
p_{i}. |
= |
prior probabilities of class j cases |
· Priors data assumes that distribution of the classes of the dependent variable in the population is the same as the proportion of the classes in the sample. It is estimated as p_{i}= N_{j} /N.
· Priors equal assumes that each class of the dependent variable is equally likely to occur in the population. For example, if the dependent variable in the sample has two classes, then prob(class 1) = prob(class 2) = 1/2.
· Priors mixed is an average of priors equal and priors data for any class at a node.
COMPONENTS FOR BUILDING A CLASSIFICATION TREE
Three components are required in the construction of a classification tree: (1) a set of questions upon which to base a split; (2) splitting rules and goodness-of-split criteria for judging how good a split is; and (3) rules for assigning a class to each terminal node. Each of these components are discussed below.
Type and Format of Questions
Two question formats are defined in CART: (1) Is X £ d?, if X is a continuous variable and d is a constant within the range of X values. For example, is income £ 2,000? Or (2) is Z = b?, if Z is a categorical variable and b is one of the integer values assumed by Z. For example, is sex = 1?
The number of possible split points on each variable is limited to the number of distinct values each variable assumes in the sample. For example, if a sample size equals N, and if X is a continuous variable and assumes N distinct points in the sample, then the maximum number of split points on X is equal to N. If Z is a categorical variable with m distinct points in a sample, then the number of possible split points on Z equals 2^{m-1} -1 (Breiman et al. 1984, 30). Unless otherwise specified, CART software assumes that each split will be based on only a single variable.
Splitting Rules and Goodness-of-Split Criteria
This component requires definition of the impurity function and impurity measure. Let
j = 1,2,...,k be the number of classes of categorical dependent variables;
then define p(j | t) as class probability distribution of the dependent variable at node t, such that p(1 | t) + p(2 | t) +p(3 | t) +... +p(k | t) = 1,j = 1, 2,..., k. Let i(t) be the impurity measure at node t. Then define i(t) as a function of class probabilities p(1 | t), p(2 | t), p (3 | t),.... Mathematically, i(t) = ¨ [p(1 | t), p(2 | t),..., p(j | t). The definition of impurity measure is generic and allows for flexibility of functional forms.
Splitting Rules. There are three major splitting rules in CART: the Gini criterion, the twoing rule, and the linear combination splits. In addition to these main splitting rules, CART users can define a number of other rules for their own analytical needs. CART uses the Gini criterion (also known as Gini diversity index) as its default splitting rule. The twoing rule is discussed in detail in Breiman et al. 1984 and will not be covered here. A brief exposition of the linear combination splits is provided later in this chapter.
The Gini impurity measure at node t is defined as i (t) = 1 - S, where S (the impurity function) = Sp^{2}(j | t ), for; =1,2,...,k (Steinberg and Colla 1995; Breiman et al. 1984).
The impurity function attains its maximum if each class (vulnerable or not) in the population occurs with equal probability. That is, p(l | t) =p(2 | t) =...= p(j | t). On the other hand, the impurity function attains its minimum (= 0) if all cases at a node belong to only one class. That is, if node t is a pure node with a zero misclassification rate, then i(t) = 0.
Goodness-of-Split Criteria. Let s be a split at node t. Then, the goodness of split "s" is defined as the decrease in impurity measured by
Di(s, t) = i(t) -p_{L} [i(h)] -p_{R}[i(t_{R})].
Where
s, |
= |
a particular split |
p_{L} |
= |
the proportion of the cases at node (that go into the left child node, t_{L}, |
p_{R} |
= |
the proportion of cases at node t that go into the right child node, t_{R}, |
i(t_{L}) |
= |
impurity of the left child node, and |
i(t_{R}) |
= |
impurity of the right child node. |
Class Assignment Rule
There are two rules for assigning classes to nodes. Each rule is based on one of two types of misclassification costs.
1. The Plurality Rule: Assign terminal node t to a class for which p(j | t) is the highest. If the majority of the cases in a terminal node belong to a specific class, then that node is assigned to that class. The rule assumes equal misclassification costs for each class. It does not take into account the severity of the cost of making a mistake. This rule is a special case of rule 2.
2. Assign terminal node t to a class for which the expected misclassification cost is at a minimum. The application of this rule takes into account the severity of the costs of misclassifying cases or observations in a certain class, and incorporates cost variability into a Gini splitting rule.
When dealing with famine vulnerability, for example, misclassifying a vulnerable household as nonvulnerable has more severe consequences than misclassifying a nonvulnerable household as vulnerable. Variable costs can be accounted for by defining a matrix of variable misclassification costs that can be incorporated into the splitting rules.
Let c(i | j) = the cost of classifying a class j case as a class i case:
c(i | j) ³ 0 if i ¹ j, c(i | j ) = 0 if i = j.
Now, assume that there are two classes in a problem. Let
p_{t}(1) |
= |
prior probability of class 1 at node t, |
p_{t}(2) |
= |
prior probability of class 2 at node t, |
r_{1}(t) |
= |
the cost of assigning node t to class 1, and |
r_{2}(t) |
= |
the cost of assigning node t to class 2. |
Given priors and variable misclassification costs, r_{1}(t) and r_{2}(t) are estimated as follows:
r_{1}(t) = I>(1) ·
c(2 | 1),
and
r_{2}(t)=x(2) · c(1 |
2).
According to rule 2, if at node t, r_{1}(t) < r_{2}(t), node t is assigned to class 1. If c(2 | 1) = c(1 | 2), then rule (1) applies and a node is assigned to a class for whitch the prior probality os the highest.
STEPS IN BUILDING A CLASSIFICATION TREE
The tree-building process starts by partitioning a sample or the root node into binary nodes based upon a very simple question of the form
is X £ d ?,
where X is a variable in the data set and d is a real number. Initially, all observations are placed in the root node. This node is impure or heterogenous because it contains observations of mixed classes. The goal is to devise a rule that will break up these observations and create groups or binary nodes that are internally more homogenous than the root node. CART uses a computer-intensive algorithm that searches for the best split at all possible split points for each variable. The methodology that CART uses for growing trees is technically known as binary recursive partitioning (Steinberg and Colla 1995). Starting from the root node, and using, for example, the Gini diversity index as a splitting rule, the tree building process is as follows:
1. CART splits the first variable at all of its possible split points (at all of the values the variable assumes in the sample). At each possible split point of a variable, the sample splits into binary or two child nodes. Cases with a "yes" response to the question posed are sent to the left node and those with "no" responses are sent to the right node.
2. CART then applies its goodness-of-split criteria to each split point and evaluates the reduction in impurity that is achieved using the formula
Di(s, t)= i(t) - p_{L} [i(t_{L})] - p_{R}[i(t_{R})],
which was described earlier.
3. CART selects the best split of the variable as that split for which the reduction in impurity is highest.
4. Steps 1-3 are repeated for each of the remaining variables at the root node.
5. CART then ranks all of the best splits on each variable according to the reduction in impurity achieved by each split.
6. It selects the variable and its split point that most reduced the impurity of the root or parent node.
7. CART then assigns classes to these nodes according to the rule that minimizes misclassification costs. CART has a built-in algorithm that takes into account user-defined variable misclassification costs during the splitting process. The default is unit or equal misclassification costs.
8. Because the CART procedure is recursive, steps 1-7 are repeatedly applied to each nonterminal child node at each successive stage.
9. CART continues the splitting process and builds a large tree. The largest tree is built if the splitting process continues until every observation constitutes a terminal node. Obviously, such a tree will have a large number of terminal nodes, which will be either pure or have very few cases (less than 10; Steinberg and Colla 1995).
Linear Combination Splits
This splitting rule is an alternative to CART's use of a single variable for splitting. It is designed for situations where the class structure of the data appears to depend on linear combinations of variables. In linear combination splits, the question posed for a node split takes the following form:
Is a_{1}·X_{1}+ a_{2} · X_{2} £ 40 ?
For example,
is.55 consumption +.05 age £ 40 ?
If the response to the question is "yes," then the case is sent to the left node, and if the response is "no," then the case is sent to the right node.
This rule is valid only for cases with no missing values on predictor variables. Furthermore, if categorical variables have to be included in the model, they should be converted to sets of dummy variables. If this option is chosen as a splitting method, it should be specified on the command line. The syntax for the command line is provided in Chapter 5.
Missing Values and Splitting
Points^{2}
^{2}This section and the following one on outliers and split points come from Seyoum et al. 1995.
Incompleteness of data may be a problem for conventional statistical analysis, but not for CART. It makes use of a "surrogate" variable splitting rule. A surrogate variable in CART is that variable that mimics or predicts the split of the primary variable. If a splitting variable used for tree construction has missing values for some cases, those cases are not thrown out. Instead, CART classifies such cases on the basis of the best surrogate variable (the variable with a close resemblance to the primary split variable). The surrogate may have a different cutoff point from the primary split, but the number of cases the surrogate split sends into left and right nodes should be very close to that with the primary split. By default, CART analysis produces five surrogate variables as part of its standard output. Surrogate splits are available only for splits based on a single variable. They are not available if the linear combination splitting rule is selected.
Apart from handling the missing data points of a case, surrogate variables can also be used for detecting the masking of variables and determining the rank of variables important either in actual or potential tree construction. Appendix 1, Example 1 provides a list of surrogates produced with the Ethiopian data and a column of variable importance (the relative importance of variables).
Outliers and Splitting Points
Outliers among the independent variables rarely affect CART analysis, because splits are generally determined at non-outlier values. If outliers exist in the dependent variable, they are isolated in small nodes, where they do not affect the rest of the tree (Webb et al. 1994).
TREE PRUNING
Large trees can have two problems: (1) Although they are highly accurate, with low or zero misclassification rates, large trees provide poor results when applied to new data sets (Steinberg and Colla 1995). And (2) understanding and interpreting trees with a large number of terminal nodes is a complicated process. Hence, large trees are referred to as complex trees. The complexity of a tree is measured by the number of its terminal nodes.
Departures from the ideal situation of low or zero misclassification entails a trade-off between accuracy and tree complexity. The relationship between tree complexity and accuracy can be understood with the cost complexity measure, which is defined as
Cost Complexity = Re substitution Misclassification Cost
+ I> Number of terminal nodes,
where I> is penalty per additional terminal node. If I> = 0, then cost complexity attains its minimum for the largest possible tree. On the other hand, as I> increases and is sufficiently large (say, infinity), a tree with one terminal node (the root node) will have the lowest cost complexity. As values of I> decrease and approach zero, trees that minimize cost complexity become larger. The "right-sized" tree with "correct" complexity should lie between these two extremes. Breiman et al. 1984 discuss how to estimate I> and offer a detailed account of the pruning process.
The search for the "right-sized" tree starts by pruning or collapsing some of the branches of the largest tree (T_{max}) from the bottom up, using the cost complexity parameter and cross-validation or an independent test sample to measure the predictive accuracy of the pruned tree. Hypothetical examples of the largest possible tree (T_{max}), the pruned branch, and the pruned tree are given in Figures 2, 3, and 4, respectively. These examples illustrate only one of the many possibilities in the tree-growing and tree-pruning process.
The pruning process produces a series of sequentially nested subtrees along with two types of misclassification costs and cost-complexity-parameter values. These are the cross-validated relative-error cost from applying tenfold cross-validation and the resubstitution relative cost generated from the learning sample. The trade-off between cost complexity and tree size can be seen in the last column of Table 2. Using the resubstitution cost, CART ranks the subtrees and generates a tree sequence table ordered from the most complex tree at the top to a less complex tree with one terminal node at the bottom (Table 2). It is a real example, taken from the computer output that produced Figure 1.
In other words, the tree-sequence table provides subtrees with a decreasing complexity (a decreasing number of terminal nodes) and an increasing cost (resubstitution relative cost). CART finally identifies the minimum-cost tree, and picks an optimal tree as the tree within one standard error of the minimum-cost tree. The option of a one-standard-error rule can be changed by the data analyst. But the reason for using a one-standard-error rule is that there may be other trees with cross-validated error rates close to those of the minimum-cost tree. Breiman et al. (1984) suggest that an optimal tree should be the one with the smallest terminal nodes among those that lie within one standard error of the minimum-cost tree. The minimum-cost tree itself could become the "right-sized" or the optimal-cost tree.
Table 2 - Example of sequence of trees produced by pruning
Dependent variable: CUTDUM2
Tree |
Number of terminal nodes |
Cross-validated relative cost |
Resubstitution relative cost |
Complexity parameter |
1 |
32 |
0.704 +/- 0.060 |
0.145 |
0.000 |
8 |
16 |
0.639 +/- 0.058 |
0.244 |
0.008 |
9 |
14 |
0.635 +/- 0.058 |
0.276 |
0.008 |
10 |
12 |
0.632 +/- 0.058 |
0.310 |
0.008 |
11* |
11 |
0.608 +/- 0.057 |
0.332 |
0.011 |
12** |
8 |
0.634 +/- 0.058 |
0.430 |
0.016 |
13 |
7 |
0.668 +/- 0.059 |
0.464 |
0.017 |
14 |
5 |
0.687 +/- 0.059 |
0.640 |
0,019 |
15 |
3 |
0.700 +/- 0.058 |
0.619 |
0.020 |
16 |
2 |
0.729 +/- 0.048 |
0.696 |
0.088 |
17 |
1 |
1.000 +/- 0.000 =0.500 |
1.000 |
0.152 |
Initial misclassification cost = 0.5000 | |
| ||
Initial class assignment = 0 | | |
* indicates minimum-cost tree.
** indicates optimum-cost tree.
In Table 2, the cross-validated relative-cost column shows that cross-validation error initially decreases as complexity decreases, reaches a minimum, and then increases. CART picks the tree with the minimum cross-validated cost as the minimum-cost tree, which is marked by an asterisk. The minimal-cost tree has 11 terminal nodes and a cross-validated cost of 0.603 +/- 0.057. The optimal tree is obtained by applying the one-standard-error rule to the minimum-cost tree. Tree number 12 with 8 terminal nodes meets the criteria of an optimal-cost tree and it is identified by two asterisks. Tree number 10 with 12 terminal nodes is another candidate for an optimal tree. However, it is more complex than tree number 12.