Cover Image
close this bookClassification and Regression Trees, Cart TM - A user manual for identifying indicators of vulnerability to famine and chronic food insecurity + disk (IFPRI, 1999, 56 p.)
View the document(introduction...)
View the documentPreface
View the documentAcknowledgments
View the document1. Introduction
View the document2. Overview of CART
View the document3. Basic Principles of Cart Methodology
View the document4. Regression Trees: An Overview
View the document5. CART Software and Program Codes
View the document6. Refining CART Analyses
View the document7. Conclusions
Open this folder and view contentsAppendix 1: Condensed Examples of Classification-Tree Output (full output on diskette)
View the documentAppendix 2: A Condensed Example of Regression-Tree Output (full output on diskette)
View the documentReferences
View the documentBack Cover

6. Refining CART Analyses

At times, it may not be possible to get the desired results from the first CART session. CART may not even produce any tree at all. To overcome these problems, some of the alternative refinements introduced in Chapter 5 may need to be applied. The structure of the trees produced may differ with each alternative. That is, the variables upon which the splits are made and the number of terminal nodes may change. Even the removal of a single variable from analysis produces a tree with a different structure. For these reasons, CART reports the cross-validated relative-error costs for a tree along with the standard errors (Breiman et al. 1984). The contingent structure of the trees raises the issue of which classification tree to choose and how to choose it. CART does a good job of producing a number of useful classification tables for each alternative based on the learning sample and cross-validation tests (see Appendix 1, Example 1). Since the goal of a classification tree is to enable the analyst to predict the class of future observations, more attention should be paid to the analysis of cross-validation classification and cross-validation classification probability tables. Of course, the choice of the tree ultimately depends on what the analyst intends to do with the tree.

To illustrate the issue of choice, several alternatives to the CART results discussed in Figure 1 in Chapter 2 are produced. The complete CART output is provided in Examples 1, 2, and 3 of Appendix 1 on the diskette. Condensed versions are provided in Examples 1, 2, and 3 of the hard copy of Appendix 1. For comparative analysis, the cross-validation classification probability is extracted from the output of the three alternative models and given below in Table 6.

Example 1 in Table 6 is based on the assumption of PRIORS EQUAL, Example 2 is based on PRIORS DATA, and Example 3 on PRIORS MIXED. For the tree in Example 1, the cross-validated error rate equals 0.634 +/- 0.058, the resubstitution estimate is 0.430, and the total correct classification is 69.2 percent (see Appendix 1, Example 1). For the tree in Example 2, the cross-validated error rate is 0.921 +/- 0.077, the resubstitution estimate is 0.663, and the total correct classification is 75.7 percent (see Appendix 1, Example 2). And finally, for the tree in Example 3, the cross-validated error rate is 0.782 +/- 0.066, the resubstitution estimate is 0.537, and the total correct classification is 73.7 percent (see Appendix 1, Example 3).

In Table 6, a matrix of predicted class probabilities is provided for each example. Under Example 1, the classification tree predicted 70.3 percent of the nonvulnerable households as nonvulnerable and 66.3 percent of the vulnerable households as vulnerable. These are very encouraging results. But can the predictions be improved? Under example 2, 88.4 percent of the nonvulnerable households were predicted to be nonvulnerable, but only 40.4 percent of the vulnerable households were predicted as vulnerable. This is not a desirable outcome because of the high error rate in predicting vulnerable households. The analyst has to think of which error rate is more costly in terms of misclassification. The results for Example 3 fall between the results of Examples 1 and 2.

The classification tree produced under the assumption of PRIORS DATA provides a better overall correct classification rate (75.7 percent) than the other trees (see Appendix 1, Examples 1, 2, and 3). But the tree in Example 1 performs best when it comes to classifying the vulnerable group. This tree correctly classifies 66.3 percent of the vulnerable households. Furthermore, comparative analysis of the predictive error rates of the three examples clearly shows that the tree of Example 1 has the smallest error rates. Thus, the classification tree in Example 1 provides the best classifiers or indicators of vulnerability. However, the final choice depends on the analyst.

There are still many other options available to the analyst. The results for some of these options are given in Examples 1, 2, and 3 on the diskette (Appendix 3, which only appears on the diskette). In these optional runs, alternative misclassification costs were added to the program to see if there were any improvements in the overall misclassification rate. No improvements resulted.

Table 6 - Cross-validation classification probability comparisons


Predicted Class


Example

Actual Class

0

1

Actual total

1

Priors equal





0

0.703

0.297

1.00


1

0.337

0.663

1.00

2

Priors data





0

0.884

0.116

1.00


1

0.596

0.404

1.00

3

Priors mixed





0

0.816

0.185

1.00


1

0.483

0.517

1.00