Classification and Regression Trees, Cart TM - A user manual for identifying indicators of vulnerability to famine and chronic food insecurity + disk (IFPRI, 1999, 56 p.)
 (introduction...) Preface Acknowledgments 1. Introduction 2. Overview of CART 3. Basic Principles of Cart Methodology 4. Regression Trees: An Overview 5. CART Software and Program Codes 6. Refining CART Analyses 7. Conclusions Appendix 1: Condensed Examples of Classification-Tree Output (full output on diskette) Appendix 2: A Condensed Example of Regression-Tree Output (full output on diskette) References Back Cover

4. Regression Trees: An Overview

Recall from Chapter 1 that CART produces a classification tree when the dependent variable is categorical and a regression tree when the dependent variable is continuous. The process of constructing a regression tree is similar to that for a classification tree. But in building a regression tree, there is no need to use priors and class assignment rules. Splitting rules, goodness-of-fit criteria, as well as measures of accuracy of a tree in regression tree differ from those for a classification tree. These issues will all be discussed in detail in the two subsections that follow the regression tree example below.

As with classification, regression-tree building centers on three major components: (1) a set of questions of the form, Is X £ d?, where X is a variable and d is a constant; the reponse to such questions is yes or no; (2) goodness-of-split criteria for choosing the best split on a variable; and (3) generation of summary statistics for terminal nodes. The latter component is unique to a regression tree. In classification trees, the terminal nodes are assigned to a specific class according to the class assignment rule. In regression trees, however, there are no classes to which terminal nodes are assigned. Instead, for each of the terminal nodes produced by CART regression, summary statistics of the dependent variable are computed.

The main purpose of CART regression is to produce a tree-structured predictor or prediction rule (Breiman et al. 1984). This predictor serves two major goals: (1) to predict accurately the dependent variable from the future or new values of the predictor variables; and (2) to explain the relationships that exist between the dependent and predictor variables. The CART regression predictor is constructed by detecting the heterogeneity (in terms of variance of the dependent variable) that exists in the data set and then purifying the data set. CART does this by recursively partitioning a data set into groups or terminal nodes that are internally more homogenous than their ancestor nodes. At each terminal node, the mean value of the dependent variable is taken as the predicted value. If the objective of a regression tree is explanation, then this is achieved by tracking the paths of a tree to a specific terminal node.

An example of a regression tree is given in Figure 5, and the list of variables supplied for generating the tree is given in Table 3.

Figure 5 - CART analysis of 77 awrajas, 1982-87

Table 3 - Variables for awraja-level analysis

 Variable Definition MZSHTTRD Retail price of maize/producer price of sheep terms of trade MZSHTTMN Average of MZSHTTRD during 1981-87 MZSHTTDV Standard Deviations of __RD from __MN MZSHTTCV Coefficient of variation of __RD during 1981-87 CERLPROD Gross production of all cereals in tons CERLMN Mean of CERLPROD during 1981-87 CERLDV Standard Deviations of CERLPROD from CERLMN CERLCV Coefficient of variation of CERLPROD during 1981-87 PCTBELG Percent of annual cereal production from Belg season PCTBLGMN Average of PCTBELG during 1981-87 PCTBLGCV Standard Deviations of PCTBELG from PCTBLGMN PCTBLGCV Coefficient of variation of PCTBLG during 1981-87 CERLPP Gross production of all cereals per capita rural population AVGFAMSZ Average size of rural household DEPRATIO Dependency ratio (and 60 years old /total population 15-59 years old) LITERATE Literacy ratio of males 15 years old /total population 15 years old TOTFERTR Total fertility rate GENFERTR General fertility rate PAR4549R Average parity (45-49 years) ASDRRURL Age-specific death rates in rural areas IMRRURAL Infant mortality rate in rural areas NPERRMRU Average people sharing bedroom in rural areas LIFEEXPR Life expectancy in rural areas CRDBRTHR Crude birth rate in rural areas GRRERRUR Gross reproductive rate MLUPSLRM Soil loss rate estimates from Master Land Use Plan POPUME Urban male population POPUFE Urban female population POPURME Rural male population POPRFE Rural female population ALLKMKM2 All-weather road/square kilometer AVGEP84R Average land elevation weighted by rural population HLTHFIND Index of health infrastructure based on need PRPRFHHD Share of female heads in total number of household heads PERENNLO Percent farmers with no perennial crops PERENNL1-5 Percent farmers with 1-6 perennial crops ANNUAL0 Percent farmers with no annual crops ANNUAL1-8 Percent farmers with 1-8 annual crops DISTBGMK Distance to large market (kilometers) DISTSMMK Distance to small market (kilometers) AVGHHINC Average household income GINIHINC Gini coefficient of average household income by awraja PCTFRMRS Percent rural population who are farmers AVGPCINC Average farm income per capita GINIPINC Gini coefficient of AVGPCINC by woreda weighted by population PCTFRALW Share farmers that always or sometimes plant belg crop PCTFRSOM Share farmers that never plant belg crop AVGNOXEN Average number of oxen owned PCT0OXEN Percent households with no oxen ANNLPCHA Average area cultivated with annual crops per capita PRNLPCHA Average area cultivated with perennial crops per capita ANLAVG Average area cultivated with annual crops by household PERLAVG Average area cultivated with perennial crops by household FALAVGHA Average area fallowed by household AVGARAHA Average arable land owned PCTIRRIG Percent of farmers using irrigation IRRIGHA Total irrigated area GINITLU Gini coefficient of TLU ownership (all species) GINIPCMK Gini coefficient of percent crop marketed PRIM0014 Percent children years old with any schooling BELGMN Average NDVI for Belg season by year BELGMX Maximum NDVI for the season, average for all pixels by awraja BELGMNMN __MN average for 1982-90 BELGMXMN __MX average for 1982-90 BELGMNCV __MN coefficient of variation for 1982-90 BELGMXCV __MX coefficient of variation for 1982-90 BELGMNDV Standard deviations of __MN from __MNMN BELGMXDV Standard deviations of __MX from __MXMN BELGSDMN Standard deviations of season average during 1982-90 BELGSXMN Standard deviations of season maximum during 1982-90 KIREMMN Average NDVI for Kirempt season by year KIREMMX Maximum NDVI for the season, average for all pixels by awraja KIRMNMN __MN average for 1982-90 KIRMXMN __MX average for 1982-90 KIRMNCV __MN coefficient of variation for 1982-90 KIRMXCV __MX coefficient of variation for 1982-90 KIRMNDV Standard deviations of __MN from __MNMN KIRMXDV Standard deviations of __MX from __MXMN KIRMSDMN Standard deviations of season average during 1982-90 KIRMSXMN Standard deviations of season maximum during 1982-90 BEGAMN Average NDVI for Bega season by year BEGAMX Maximum NDVI for the season, average for all pixels by awraja BEGAMNMN __MN average for 1982-90 BEGAMXMN __MX average for 1982-90 BEGAMNCV __MN coefficient of variation for 1982-90 BEGAMXCV __MX coefficient of variation for 1982-90 BEGAMNDV Standard deviations of __MN from __MNMN BEGAMXDV Standard deviations of __MX from __MXMN BEGASDMN Standard deviations of season average during 1982 - 90 BEGASXMN Standard deviations of season maximum during 1982 - 90, NDVIMNMX Maximum of mean NDVIs for 3 seasons averaged for 1982-90 NDVIMXMX Maximum of season NDVI maxima averaged for 1982-90 URBPOPSR Percent urban population by awraja

Note: An awraja is an administrative district in Ethiopia below the province level; a woreda is an administrative district below the awraja level.

BUILDING A REGRESSION TREE

The process of constructing a regression tree is similar to that for building a classification tree. Regression-tree building centers on three major components: (1) a set of questions of the form,

Is X £ d?,

where X is a variable and d is a constant. As with classification, the response to such questions is yes or no; (2) goodness-of-split criteria for choosing the best split on a variable; and (3) the generation of summary statistics for terminal nodes (unique to a regression tree).

An example of a regression tree is given in Figure 5, and the list of variables supplied for generating the tree is given in Table 3.

REGRESSION TREE: EXAMPLE

The regression tree in Figure 5 is based on analysis from a regional vulnerability study in Ethiopia (Seyoum et al. 1995) that uses six years (1982-87) of time-series data collected from 77 administrative regions (awrajas) of Ethiopia. The data contain 92 variables, all listed in Table 3. This study of famine (Seyoum et al. 1995) had two specific goals: (1) to determine whether it is possible to estimate or predict the percent of sedentary population in need of food assistance, and (2) to understand the variability in percentages of people in need (PPND) across awrajas and years. The dependent variable in the study is PPND.

The top rectangle in Figure 5 contains a total number of 462 observations (N=462) with an average PPND of 11 percent. (During the six-year period of the study, an average of 11 percent of the population was in need of food assistance.) The regression tree produces 10 terminal nodes or homogenous groups or awraja strata. Each group is identified by a number from 1 to 10. The specific path leading from the root node to the terminal node for each group characterizes that group. In Figure 5, NDVI (normalized difference vegetation index) is a crude estimate of vegetation health, and is used as an index of greenness. The possible range of values for NDVI is between -1 and 1. However, its typical range is between -0.1 (for not a green area) and 0.6 (for a very green area). The higher the index, the greener the vegetation.

The first split of the root node is based on the long-term average NDVI variable. This split successfully separates awrajas with less green vegetation from awrajas with very green vegetation. The long-term average NDVI is indeed a powerfully discriminating variable for studying regional vulnerability. In awrajas with very green vegetation, average PPND is 3 percent, which is much lower than awrajas with less green vegetation. Awrajas with greener vegetation are further separated using the variable for the long-term average maximum NDW of the main rainy season. This split results in two terminal nodes: Group 9 and Group 10. Predicted PPND is 9 percent in Group 9 and 2 percent in Group 10. The low PPND for these two groups should not be surprising. It can be argued that these regions have better supplies of food and, hence, food accessibility, than awrajas with less green vegetation. Indeed, it turns out that these awrajas extend west, south, southwest, and northwest from central Ethiopia (Webb et al. 1994, Map 6.0). These awrajas also produce surplus grain in the country. Some awrajas in Group 9 do represent pockets of vulnerability in this surplus-producing region.

Awrajas in Groups 1 through 8 have at least one characteristic in common. They all descend from awrajas with a less green vegetation index (long-term average NDVI £ 0.335). Group 1 awrajas are characterized by low long-term average NDVI, low sheep-to-maize terms of trade, and low coefficient of variation of dry season NDVI. There are 13 awrajas at this terminal node with a predicted PPND of 14 percent. The fact that the long-term average NDVI is low suggests that the long-term annual average rainfall in these awrajas is very low and crop production is limited. This observation is justified by the low sheep-to-maize terms of trade. A household can only buy 31.4 kilograms or less of maize with one sheep, indicating that maize is scarce in these areas. These awrajas are in south Gamgofa, northeast Shoa, northeast Bale, and west Hararge regions of Ethiopia. Generally, rainfall in these regions is far below the national average.

Awrajas in Group 2 and Group 3 are both characterized by low long-term average NDVI, low sheep-to-maize terms of trade, a high coefficient of variation of dry season NDVI, and low density of all-weather roads per square kilometer. They are distinct from each other only because of household size. Group 2 awrajas have a lower household size than those in Group 3. For the three awrajas in Group 2, predicted PPND equals 74 percent. For the 21 awrajas in Group 3, predicted PPND equals 23 percent. The awrajas in these two groups are located in southern Bale, southern Sidamo, eastern Gondar, western Wollo, northeast Wollo, and north Harerge regions of Ethiopia. The transportation network in these regions is limited due to land topography. Not surprisingly, CART characterizes these two groups as low in the density of all-weather roads per square kilometer. The regions in these two groups are also known for being among the most vulnerable to famine in Ethiopia. The remaining terminal nodes can be analyzed in a similar way.

Figure 5 displays the power of CART analysis as did Figure 1. It shows that CART has successfully identified 10 groups of awrajas by using only 9 out of the 92 variables submitted for analysis (Table 3). Each group is identified by the path that begins at the root node and ends at its terminal node. The 9 variables along with their split points carry all the information that is needed to differentiate groups of awrajas from each other.

The Steps to Building a Regression Tree

The mechanism for building a regression tree is similar to that for a classification tree. But with a regression tree there is no need to specify priors and misclassification costs. Furthermore, the dependent variable in a regression tree is numeric or continuous. The splitting criterion employed is the within-node sum of squares of the dependent variable and the goodness of a split is measured by the decrease achieved in the weighted sum of squares. Detailed discussion on splitting criteria will be provided further below. The following list highlights the key steps in constructing a regression tree.

1. Starting with the root node, CART performs all possible splits on each of the predictor variables, applies a predefined node impurity measure to each split, and determines the reduction in impurity that is achieved.

2. CART then selects the "best" split by applying the goodness-of-split criteria and partitions the data set into left- and right-child nodes.

3. Because CART is recursive, it repeats steps 1 and 2 for each of the nonterminal nodes and produces the largest possible tree.

4. Finally, CART applies its pruning algorithm to the largest tree and produces a sequence of subtrees of different sizes from which an optimal tree is selected.

Splitting Rules and Goodness-of-Fit Criteria

There are two splitting rules or impurity functions for a regression tree. These are (1) the Least Squares (LS) function and (2) the Least Absolute Deviation (LAD) function. Since the mechanism for both rules is the same, only the LS impurity measure will be described. Under the LS criterion, node impurity is measured by within-node sum of squares, SS(t), which is defined as

, for i=1,2,...,Nt

where yi(t)= individual values of the dependent variable at node t, and

the mean of the dependent variable at node t. Given the impurity function, SS(t), and split s that sends cases to left (tL) and right (tR) nodes, the goodness of a split is measured by the function

Æ(s, t)=SS(t)-SS(tR)-SS(tL),

where SS(tR) is the sum of squares of the right child node, and SS(tL) is the sum of squares of the left child node.

The best split is that split for which Æ(s, t) is the highest. From the series of splits generated by a variable at a node, the rule is to choose that split that results in the maximum reduction in the impurity of the parent node.

An alternative to SS(t) is to use the weighted variance of left and right nodes, where the weights are proportions of cases at nodes tL and tR let p(t) = Nt/N be the proportion of cases at node t, and let s2(t) be the variance of the dependent variable at node t. The variance is defined as

The goodness of a split is now measured by

f(s,t)=s2(t)-[pLs2(tL)+pRs2(tR)].

The best split is the one for which f(s,t) is the highest or for which the weighted sum of the variances [pLs2(tL)+pRs2(tR)] is the smallest. The procedure successfully separates high values of the dependent variable from its low values and results in left and right nodes that are now internally more homogenous than the parent node. It should be noted that as each split sends observations to the left and right nodes, the mean of the dependent variable in one of the resulting nodes is lower than the mean at the parent node (see the example in Figure 5).

TREE PRUNING

After building the largest possible tree, CART applies its pruning algorithm by using either cross-validation or an independent test sample to measure the goodness of fit of the tree. LS uses Mean Squared Error (MSE) to measure the accuracy of the predictor in order to rank the sequence of trees generated by pruning. LAD employs Mean Absolute Deviation (MAD). Once a minimal-cost tree (the tree with the lowest MSE OR MAD) is identified, an optimal tree is chosen by applying the one-standard-error rule to the minimal-cost tree. The one-standard-error rule is optional and can be changed by the analyst.

After choosing an optimal tree or, for that matter, any subtree from the sequence of subtrees generated in the pruning process, CART computes summary statistics for each of the terminal nodes. If LS is chosen as a splitting rule, CART computes mean and standard deviations of the dependent variable. The mean of the terminal node becomes the predicted value of the dependent variable for cases in that terminal node. If LAD is selected, CART generates median and average absolute mean deviations of the dependent variable. As with LS, the median becomes the predicted value of the dependent variable for that terminal node.

This form of generating predictions may sound crude to those who are familiar with predictions from parametric models. But it should be noted that CART regression predictions are arrived at by recursively splitting the sample and creating groups or clusters that are progressively more homogenous than their ancestor nodes. Breiman et al. (1984) suggest running OLS models in each group created by the regression tree and comparing the OLS predictions against each other. A considerable difference between the predicted values of OLS models for each group is an indication that CART has succeeded in uncovering the complex structure existing in the data set.