Classification and Regression Trees, Cart TM - A user manual for identifying indicators of vulnerability to famine and chronic food insecurity + disk (IFPRI, 1999, 56 p.)
 (introduction...) Preface Acknowledgments 1. Introduction 2. Overview of CART 3. Basic Principles of Cart Methodology 4. Regression Trees: An Overview 5. CART Software and Program Codes 6. Refining CART Analyses 7. Conclusions Appendix 1: Condensed Examples of Classification-Tree Output (full output on diskette) Appendix 2: A Condensed Example of Regression-Tree Output (full output on diskette) References Back Cover

### 2. Overview of CART

CART is a nonparametric statistical methodology developed for analyzing classification issues either from categorical or continuous dependent variables. If the dependent variable is categorical, CART produces a classification tree. When the dependent variable is continuous, it produces a regression tree. Detailed discussion of a regression tree is provided in Chapter 4. In both classification and regression trees, CART's major goal is to produce an accurate set of data classifiers by uncovering the predictive structure of the problem under consideration (Breiman et al. 1984). That is, CART helps in understanding the variables or interaction of variables that are responsible for a given phenomenon, such as famine, and that best determine one outcome rather than another (Seyoum et al. 1995). The purpose of such classifiers or classification rules is to enable one to predict the class (vulnerable or not vulnerable, in the case of famine households) of any future observation(s) from the profile of characteristics submitted for analysis. That is, given the characteristics of an observation, the goal is to find out whether the observation falls into the vulnerable class or not. The example in Figure 1 illustrates how CART methodology works.

In brief, the construction of a CART classification rule centers on the definition of three major elements discussed in Chapter 3. These are (1) the sample-splitting rule, (2) the goodness-of-split criteria, and (3) the criteria for choosing an optimal or final tree for analysis. CART builds trees by applying predefined splitting rules and goodness-of-split criteria at every step in the node-splitting process. In a highly condensed form, the steps in the tree-building process involve (1) growing a large tree (a tree with a large number of nodes), (2) combining some of the branches of this large tree to generate a series of subtrees of different sizes (varying numbers of nodes), and (3) selecting an optimal tree via the application of "measures of accuracy of the tree." These will be described in full in Chapter 3.

In Figure 1, the results of a CART analysis based on research on the vulnerability to famine (Webb et al. 1994) is summarized graphically in the form of an inverted tree. The CART analysis has two major objectives: (1) to get a better understanding of the characteristics of households that were vulnerable to famine, and (2) to generate tree-structured classifiers or indicators of vulnerability and assess the potential of these indicators for accurately predicting the prevalence of vulnerability to famine in the future.

Figure 1 - Classification tree of a famine vulnerability study

Notes: N stands for number of households at each node. TLU is tropical livestock unit, which converts big and small animals into a common unit.

The analysis is based on a sample survey of 338 households that was conducted in 1989/90 in Ethiopia. The list of variables used in the analysis is given in Table 1. The dependent variable is CUTDUM2. It is an indicator of vulnerability defined as a 0/1 binary variable. A household is vulnerable to famine if CUTDUM2=1 and not vulnerable if it equals 0. These two categories of vulnerability are referred to as class 1 and class 0, respectively. During the Ethiopian famine in the 1980s, 89 of the sample households were classified as vulnerable to famine, while 249 were not. The top circle in Figure 1 contains this basic information (N=338, yes=89, and no=249).

Table 1 - Household variables

 Name Definitions PCAST80 Per capita value of household assets (farm and nonfarm) PCNFRAST Per capita value of nonfarm assets (excluding livestock) PCLIVINC Household income per capita from livestock and products PCFRMAST Value of farm assets per capita PCWC Household income per capita PCAGWC Household income from crops and livestock per capita PCLSU80 Tropical livestock units owned per capita PCFRMWC Crop income per capita PCNNFINC Nonfarm income per capita LVSLSU80 Total tropical livestock units owned per household FRMASRAT Value of farm assets in total value of assets held NFRMASRA Value of nonfarm assets in total value of assets held CERLAR80 Cereal area cultivated (hectares) CERYLD80 Cereal yields (wheat equivalents in kilograms per hectare) HHEADSBX Gender of household head GINI Index of crop diversity (larger number = lower diversity) OXQ80 Number of oxen owned per household NCERYL80 Noaeereal yields (wheat equivalents in kilograms per hectare) NCERAR80 Noncereal area cultivated (hectares) AGINCRAT Share of crop and livestock income in total income LIVSYRAT Share of income from livestock and livestock products in total income FARMYRAT Share of crop income in total income NFRMYRAT Share of nonfarm income in total income PCDCALS Calorie consumption per day per capita HHSIZE Household size CUTDUM2 Dummy variable (1 = vulnerable household; 0 = not vulnerable) CALDUM Per capita daily calorie consumption group

Source: International Food Policy Research Institute/Office of National Committee for Central Planning (Ethiopia)/International Livestock Center for Africa (now the International Livestock Research Institute) survey, 1989/90, reported in Webb, von Braun, and Yohannes 1992.

Without going into technical details of the tree-building process (see Chapter 3), it should simply be noted here that CART splits a sample into binary subsamples based on the response to a very simple question requiring only a yes/no answer. The question used to create splits is given at the bottom of each circle (Figure 1). Each question is based only on a single variable chosen from the list of variables in Table 1. Depending on the response (yes/no) to the question, the sample is partitioned into left and right binary subsamples. The issue of how CART chooses a variable and its split point is discussed in Chapter 3. When a split occurs, the subsamples, also called nodes, end up either in a circle or in a rectangular box. The rectangular boxes are referred to as terminal nodes and the circles are nonterminal nodes. Terminal nodes do not split further, while nonterminal nodes do. From here on, node will be used instead of subsample.

The noncereal yield variable produces the first split in the sample (Figure 1). Noncereals are composed mostly of pulses and are given in terms of wheat equivalents. Noncereals, especially pulses, constitute a major component in the diet of the poor in Ethiopia. The average non-cereal yield across the sample is 247 quintals per hectare. The cutoff point is 4.7 quintals per hectare. Households with low noncereal yield go to the left node and the remaining to the right node. The right node is in a rectangle and cannot be split any further. Underneath this node are the labels "H" and "class 0 node." These labels identify, respectively, the node and the class to which the node is assigned. This terminal node is classified as class 0 because it contains nonvulnerable households. The left node is nonterminal and is subject to a further split.

The second split is based on whether a household owns less than two oxen. Because farmers can cultivate only with a pair of oxen, households with one ox or none go to the left node and the remaining to the right node. For households with no more than one ox, the next split Is based on a crop diversity index. This index measures the mix of crops planted by a household. The higher the diversity index, the more mixed or diversified are the planted crops. Households with a crop diversity index of less than or equal to 0.34 are sent to the left node while those with a higher diversity index are sent to the right node.

Continuing with the split, households with a crop diversity index of at most 0.34 are further split based on the tropical livestock unit (TLU) per capita variable. TLU is an index that converts big and small animals into a common unit. Households with TLU less than or equal to 1.7 per capita are sent to the left terminal node while the others go to the right terminal node. The two terminal nodes are labeled A and B. Terminal node A is classified as class 0 (nonvulnerable households), while terminal node B is classified as class 1 (vulnerable households). The other terminal nodes, labeled C through G, are generated in a similar manner.

Each terminal node is the endpoint of a separate path or structure, and yet a group of them end up in the same class. This indicates that paths to vulnerability or nonvulnerability to famine depend on the amount of resources with which households are endowed. Households in terminal nodes A, D, F, and H, are classified as nonvulnerable to famine, while households in terminal nodes B, C, E, and G are classified as vulnerable.

The sequential structure leading to terminal node B indicates that this set of vulnerable households has extremely low noncereal yield per hectare, one ox or none, low crop diversity, and high TLU per capita. These are typically extremely poor households whose livelihoods appear to depend mostly on livestock holdings. Indeed, examination of the data set shows that 87.5 percent of the vulnerable households at this terminal node come from a survey site where 70 percent of the households reported reduction in the number of meals consumed during the Ethiopian famine of the 1980s. Further- more, it is a pastoral site (Beke Pond) located in a lowland area where livestock rather than farming sustain well-being. Most of the characteristics of households in this terminal are captured by the four variables used to arrive at the node.

Households in terminal node C are identified by extremely low noncereal yield per hectare ownership of one ox or none, at least average crop diversity, and a household size of at most 6.5. Examination of the data set shows that 71 percent of the vulnerable households at this terminal node come from the Dinki area, which was the survey site most affected by the famine of the 1980s (Webb et al. 1992). Nearly 71 percent of the households at this survey site reported reducing the number of meals consumed during the famine. Clearly, the four variables that lead to this terminal node along with their cutoff points form the best indicators of vulnerability to famine for households at this location.

Terminal node E characterizes vulnerable households as those with extremely low noncereal yield per hectare, less than two oxen, at least average crop diversity, large household size, and almost all income derived from agriculture. Fifty percent of the vulnerable households at this terminal node come from the Dinki survey site.

Terminal node G is a pure node. It contains only households that are vulnerable to famine. These are households with extremely low noncereal yield per hectare, at least two oxen, and a large per capita livestock holding. The vulnerable households at this terminal node come from Beke Pond (a pastoral site).

The most interesting aspect of this exercise is that the CART procedure identified the characteristics of households most affected by the famine of the 1980s by using only 6 of the 27 variables. These 6 variables along with their cutoff points carry most of the information required for establishing tree-structured classification rules that could identify vulnerable households in the future. Vulnerable households at Dinki and Beke Pond account for 67 percent of all vulnerable households in the 7 survey sites. CART has successfully untangled the complexities of a data set and identified the indicators of households vulnerable to famine.

HIGHLIGHTS OF OTHER CLASSIFICATION METHODS AND PROCEDURES

Besides CART, a number of other methods and procedures for classifying data exist. These methods fall into two groups.

 Group 1 Group 2 AID Discriminant analysis THAID Kernel density estimation CHAID Kth nearest neighbor Logistic regression Probit models

The methods in Group 1 generate classification trees. AID is an acronym for Automatic Interaction Detection. It is a classification algorithm developed by J. N. Morgan and J. A. Sonquist in 1963 at the University of Michigan. The AID algorithm led to the development of THAID (a sequential search program for analysis of nominal scale dependent variables) by Morgan and Messenger at the University of Michigan in 1973, and Chi-squared Automatic Interaction Detection (CHAID) by Kass in 1980. These three procedures generate multilevel splits in producing classification trees. Unlike CART, they are not distribution-free and they all employ significance tests on predictor variables to generate node splits and determine the size of a tree. These two methods differ from CART in the process of tree growing and pruning and estimation of predictive error results.

The methods in Group 2 do not produce classification trees. They all assume functional relationships between dependent and predictor variables. Discriminant analysis, Kernel density estimation, and Kth nearest neighbor are the most widely used classification methods. Breiman et al. (1984, 15 - 17) provide details on these methods and their weaknesses. Since discriminant analysis or its variation, linear discriminant function, has been widely used as a classification method, especially in education and in psychology, business, and marketing research (for example, in product targeting and market segmentation), a brief review of the methodology follows.

In order to use the linear discriminant function method, the following distributional assumptions must hold (Maddala 1983):

1. All of the predictor variables should follow multivariate normal distribution for each class of dependent variable, and

2. The variance-covariance matrixes of each class should be equal.

The procedure first forms a linear combination of predictor variables and then the coefficients of the variables in the linear combination are estimated. This is followed by computation of a discriminant score for each case or observation using the estimated coefficients and the corresponding values of the predictor variables. A classification rule is formed by applying Baye's Rule to the discriminant scores.

The distributional assumption of normality is strong and the methodology is used regardless of whether the assumptions hold for every variable used in the analysis. The method is designed to handle only continuous predictor variables. Categorical predictor variables need to be transformed into a series of dummy variables. This additional task leads to the problem of dimensionality (having too many variables). Furthermore, all variables that enter into linear combination have to be complete. That is, no case with missing values for a variable can be used in the analysis. Observations with missing values for a variable have to be dropped. This may result in bias due to reduced sample size. Also, the procedure is known to yield poor results if the predictor variables are all binary or a mixture of continuous and binary.

Logistic regression and probit models are other parametric methods used in classification studies. The final outcome of these methods yields the proportion of predicted cases that falls into different categories of the dependent variable. As in linear discriminant analysis, these methods are not distribution-free, do not have any provision for analyzing cases with missing values for a variable, and deal only with categorical dependent variables. As in all parametric models, the variables used in the analysis are entirely determined by the analyst.

CART methodology further develops and enhances the work on classification methodology of AID and THAID (Breiman et al. 1984). But CART overcomes the problems associated with these algorithms and some of the drawbacks associated with the classification methods in Group 2.

Breiman et al. (1984) made several comparative analyses of CART and discriminant analysis results and found that CART performed better than discriminant analysis. Marais, Patell, and Wolfson (1985) also noted similar findings in their classification study of commercial loans, as did Srinivasan and Kim (1987) in their credit-granting study. But in models where linear structure and the assumption of normality hold, Breiman et al. (1984) found that results from discriminant analysis were better than those from CART. Regardless of the problems with other procedures, Breiman et al. (1984) advise not to use CART "to the exclusion of other methods." Whenever possible, one of the other methods should be used for comparative purposes.

SUMMING UP CART'S STRENGTHS AND WEAKNESSES

Breiman et al. (1984) and Steinberg and Colla (1995) provide a number of justifications for using CART. A few of them are listed below.

1. CART makes no distributional assumptions of any kind for dependent and independent variables. No variable in CART is assumed to follow any kind of statistical distribution.

2. The explanatory variables in CART can be a mixture of categorical and continuous.

3. CART has a built-in algorithm to deal with the missing values of a variable for a case, except when a linear combination of variables is used as a splitting rule (see Chapter 3).

4. CART is not at all affected by the outliers, collinearities, heteroskedasticity, or distributional error structures that affect parametric procedures. Outliers are isolated into a node and thus have no effect on splitting. Contrary to situations in parametric modeling, CART makes use of collinear variables in "surrogate" splits.

5. CART has the ability to detect and reveal variable interactions in the data set.

6. CART does not vary under a monotone transformation of independent variables; that is, the transformation of explanatory variables to logarithms or squares or square roots has no effect on the tree produced.

7. In the absence of a theory that could guide a researcher, in a famine vulnerability study, for example, CART can be viewed as an exploratory, analytical tool. The results can reveal many important clues about the underlying structure of famine vulnerability.

8. CART's major advantage is that it deals effectively with large data sets and the issues of higher dimensionality; that is, it can produce useful results from a large number of variables submitted for analysis by using only a few important variables.

9. The inverted-tree-structure results generated from CART analysis are easy for anyone to understand in any discipline.

CART analysis does have some limitations, however.1 CART is a blunt instrument compared to many other statistical and analytical techniques. At each stage, the subdivision of data into two groups is based on only one value of only one of the potential explanatory variables. If a statistical model that appears to fit the data exists, and if its basic assumptions appear to be satisfied, that model would be preferable, in general, to a CART tree.

1The following is adapted from Seyoum et al. 1995

A weakness of the CART method and, hence, of the conclusions it may yield is that it is not based on a probabilistic model. There is no probability level or confidence interval associated with predictions derived from a CART tree that could help classify a new set of data. The confidence that an analyst can have in the accuracy of the results produced by a given CART tree is based purely on that tree's historical accuracy - how well it has predicted the desired response in other, similar circumstances. This is essentially how the structure of the tree is determined in the first place, through k-fold cross-validation, which will be discussed in Chapter 3.