Classification and Regression Trees, Cart TM - A user manual for identifying indicators of vulnerability to famine and chronic food insecurity + disk (IFPRI, 1999, 56 p.) |
Recall from Chapter 1 that CART produces a classification tree when the dependent variable is categorical and a regression tree when the dependent variable is continuous. The process of constructing a regression tree is similar to that for a classification tree. But in building a regression tree, there is no need to use priors and class assignment rules. Splitting rules, goodness-of-fit criteria, as well as measures of accuracy of a tree in regression tree differ from those for a classification tree. These issues will all be discussed in detail in the two subsections that follow the regression tree example below.
As with classification, regression-tree building centers on three major components: (1) a set of questions of the form, Is X £ d?, where X is a variable and d is a constant; the reponse to such questions is yes or no; (2) goodness-of-split criteria for choosing the best split on a variable; and (3) generation of summary statistics for terminal nodes. The latter component is unique to a regression tree. In classification trees, the terminal nodes are assigned to a specific class according to the class assignment rule. In regression trees, however, there are no classes to which terminal nodes are assigned. Instead, for each of the terminal nodes produced by CART regression, summary statistics of the dependent variable are computed.
The main purpose of CART regression is to produce a tree-structured predictor or prediction rule (Breiman et al. 1984). This predictor serves two major goals: (1) to predict accurately the dependent variable from the future or new values of the predictor variables; and (2) to explain the relationships that exist between the dependent and predictor variables. The CART regression predictor is constructed by detecting the heterogeneity (in terms of variance of the dependent variable) that exists in the data set and then purifying the data set. CART does this by recursively partitioning a data set into groups or terminal nodes that are internally more homogenous than their ancestor nodes. At each terminal node, the mean value of the dependent variable is taken as the predicted value. If the objective of a regression tree is explanation, then this is achieved by tracking the paths of a tree to a specific terminal node.
An example of a regression tree is given in Figure 5, and the list of variables supplied for generating the tree is given in Table 3.
Table 3 - Variables for awraja-level analysis
Variable |
Definition |
MZSHTTRD |
Retail price of maize/producer price of sheep terms of trade |
MZSHTTMN |
Average of MZSHTTRD during 1981-87 |
MZSHTTDV |
Standard Deviations of __RD from __MN |
MZSHTTCV |
Coefficient of variation of __RD during 1981-87 |
CERLPROD |
Gross production of all cereals in tons |
CERLMN |
Mean of CERLPROD during 1981-87 |
CERLDV |
Standard Deviations of CERLPROD from CERLMN |
CERLCV |
Coefficient of variation of CERLPROD during 1981-87 |
PCTBELG |
Percent of annual cereal production from Belg season |
PCTBLGMN |
Average of PCTBELG during 1981-87 |
PCTBLGCV |
Standard Deviations of PCTBELG from PCTBLGMN |
PCTBLGCV |
Coefficient of variation of PCTBLG during 1981-87 |
CERLPP |
Gross production of all cereals per capita rural population |
AVGFAMSZ |
Average size of rural household |
DEPRATIO |
Dependency ratio (and 60 years old /total population 15-59 years old) |
LITERATE |
Literacy ratio of males 15 years old /total population 15 years old |
TOTFERTR |
Total fertility rate |
GENFERTR |
General fertility rate |
PAR4549R |
Average parity (45-49 years) |
ASDRRURL |
Age-specific death rates in rural areas |
IMRRURAL |
Infant mortality rate in rural areas |
NPERRMRU |
Average people sharing bedroom in rural areas |
LIFEEXPR |
Life expectancy in rural areas |
CRDBRTHR |
Crude birth rate in rural areas |
GRRERRUR |
Gross reproductive rate |
MLUPSLRM |
Soil loss rate estimates from Master Land Use Plan |
POPUME |
Urban male population |
POPUFE |
Urban female population |
POPURME |
Rural male population |
POPRFE |
Rural female population |
ALLKMKM2 |
All-weather road/square kilometer |
AVGEP84R |
Average land elevation weighted by rural population |
HLTHFIND |
Index of health infrastructure based on need |
PRPRFHHD |
Share of female heads in total number of household heads |
PERENNLO |
Percent farmers with no perennial crops |
PERENNL1-5 |
Percent farmers with 1-6 perennial crops |
ANNUAL0 |
Percent farmers with no annual crops |
ANNUAL1-8 |
Percent farmers with 1-8 annual crops |
DISTBGMK |
Distance to large market (kilometers) |
DISTSMMK |
Distance to small market (kilometers) |
AVGHHINC |
Average household income |
GINIHINC |
Gini coefficient of average household income by awraja |
PCTFRMRS |
Percent rural population who are farmers |
AVGPCINC |
Average farm income per capita |
GINIPINC |
Gini coefficient of AVGPCINC by woreda weighted by population |
PCTFRALW |
Share farmers that always or sometimes plant belg crop |
PCTFRSOM |
Share farmers that never plant belg crop |
AVGNOXEN |
Average number of oxen owned |
PCT0OXEN |
Percent households with no oxen |
ANNLPCHA |
Average area cultivated with annual crops per capita |
PRNLPCHA |
Average area cultivated with perennial crops per capita |
ANLAVG |
Average area cultivated with annual crops by household |
PERLAVG |
Average area cultivated with perennial crops by household |
FALAVGHA |
Average area fallowed by household |
AVGARAHA |
Average arable land owned |
PCTIRRIG |
Percent of farmers using irrigation |
IRRIGHA |
Total irrigated area |
GINITLU |
Gini coefficient of TLU ownership (all species) |
GINIPCMK |
Gini coefficient of percent crop marketed |
PRIM0014 |
Percent children years old with any schooling |
BELGMN |
Average NDVI for Belg season by year |
BELGMX |
Maximum NDVI for the season, average for all pixels by awraja |
BELGMNMN |
__MN average for 1982-90 |
BELGMXMN |
__MX average for 1982-90 |
BELGMNCV |
__MN coefficient of variation for 1982-90 |
BELGMXCV |
__MX coefficient of variation for 1982-90 |
BELGMNDV |
Standard deviations of __MN from __MNMN |
BELGMXDV |
Standard deviations of __MX from __MXMN |
BELGSDMN |
Standard deviations of season average during 1982-90 |
BELGSXMN |
Standard deviations of season maximum during 1982-90 |
KIREMMN |
Average NDVI for Kirempt season by year |
KIREMMX |
Maximum NDVI for the season, average for all pixels by awraja |
KIRMNMN |
__MN average for 1982-90 |
KIRMXMN |
__MX average for 1982-90 |
KIRMNCV |
__MN coefficient of variation for 1982-90 |
KIRMXCV |
__MX coefficient of variation for 1982-90 |
KIRMNDV |
Standard deviations of __MN from __MNMN |
KIRMXDV |
Standard deviations of __MX from __MXMN |
KIRMSDMN |
Standard deviations of season average during 1982-90 |
KIRMSXMN |
Standard deviations of season maximum during 1982-90 |
BEGAMN |
Average NDVI for Bega season by year |
BEGAMX |
Maximum NDVI for the season, average for all pixels by awraja |
BEGAMNMN |
__MN average for 1982-90 |
BEGAMXMN |
__MX average for 1982-90 |
BEGAMNCV |
__MN coefficient of variation for 1982-90 |
BEGAMXCV |
__MX coefficient of variation for 1982-90 |
BEGAMNDV |
Standard deviations of __MN from __MNMN |
BEGAMXDV |
Standard deviations of __MX from __MXMN |
BEGASDMN |
Standard deviations of season average during 1982 - 90 |
BEGASXMN |
Standard deviations of season maximum during 1982 - 90, |
NDVIMNMX |
Maximum of mean NDVIs for 3 seasons averaged for 1982-90 |
NDVIMXMX |
Maximum of season NDVI maxima averaged for 1982-90 |
URBPOPSR |
Percent urban population by awraja |
Note: An awraja is an administrative district in Ethiopia below the province level; a woreda is an administrative district below the awraja level.
BUILDING A REGRESSION TREE
The process of constructing a regression tree is similar to that for building a classification tree. Regression-tree building centers on three major components: (1) a set of questions of the form,
Is X £ d?,
where X is a variable and d is a constant. As with classification, the response to such questions is yes or no; (2) goodness-of-split criteria for choosing the best split on a variable; and (3) the generation of summary statistics for terminal nodes (unique to a regression tree).
An example of a regression tree is given in Figure 5, and the list of variables supplied for generating the tree is given in Table 3.
REGRESSION TREE: EXAMPLE
The regression tree in Figure 5 is based on analysis from a regional vulnerability study in Ethiopia (Seyoum et al. 1995) that uses six years (1982-87) of time-series data collected from 77 administrative regions (awrajas) of Ethiopia. The data contain 92 variables, all listed in Table 3. This study of famine (Seyoum et al. 1995) had two specific goals: (1) to determine whether it is possible to estimate or predict the percent of sedentary population in need of food assistance, and (2) to understand the variability in percentages of people in need (PPND) across awrajas and years. The dependent variable in the study is PPND.
The top rectangle in Figure 5 contains a total number of 462 observations (N=462) with an average PPND of 11 percent. (During the six-year period of the study, an average of 11 percent of the population was in need of food assistance.) The regression tree produces 10 terminal nodes or homogenous groups or awraja strata. Each group is identified by a number from 1 to 10. The specific path leading from the root node to the terminal node for each group characterizes that group. In Figure 5, NDVI (normalized difference vegetation index) is a crude estimate of vegetation health, and is used as an index of greenness. The possible range of values for NDVI is between -1 and 1. However, its typical range is between -0.1 (for not a green area) and 0.6 (for a very green area). The higher the index, the greener the vegetation.
The first split of the root node is based on the long-term average NDVI variable. This split successfully separates awrajas with less green vegetation from awrajas with very green vegetation. The long-term average NDVI is indeed a powerfully discriminating variable for studying regional vulnerability. In awrajas with very green vegetation, average PPND is 3 percent, which is much lower than awrajas with less green vegetation. Awrajas with greener vegetation are further separated using the variable for the long-term average maximum NDW of the main rainy season. This split results in two terminal nodes: Group 9 and Group 10. Predicted PPND is 9 percent in Group 9 and 2 percent in Group 10. The low PPND for these two groups should not be surprising. It can be argued that these regions have better supplies of food and, hence, food accessibility, than awrajas with less green vegetation. Indeed, it turns out that these awrajas extend west, south, southwest, and northwest from central Ethiopia (Webb et al. 1994, Map 6.0). These awrajas also produce surplus grain in the country. Some awrajas in Group 9 do represent pockets of vulnerability in this surplus-producing region.
Awrajas in Groups 1 through 8 have at least one characteristic in common. They all descend from awrajas with a less green vegetation index (long-term average NDVI £ 0.335). Group 1 awrajas are characterized by low long-term average NDVI, low sheep-to-maize terms of trade, and low coefficient of variation of dry season NDVI. There are 13 awrajas at this terminal node with a predicted PPND of 14 percent. The fact that the long-term average NDVI is low suggests that the long-term annual average rainfall in these awrajas is very low and crop production is limited. This observation is justified by the low sheep-to-maize terms of trade. A household can only buy 31.4 kilograms or less of maize with one sheep, indicating that maize is scarce in these areas. These awrajas are in south Gamgofa, northeast Shoa, northeast Bale, and west Hararge regions of Ethiopia. Generally, rainfall in these regions is far below the national average.
Awrajas in Group 2 and Group 3 are both characterized by low long-term average NDVI, low sheep-to-maize terms of trade, a high coefficient of variation of dry season NDVI, and low density of all-weather roads per square kilometer. They are distinct from each other only because of household size. Group 2 awrajas have a lower household size than those in Group 3. For the three awrajas in Group 2, predicted PPND equals 74 percent. For the 21 awrajas in Group 3, predicted PPND equals 23 percent. The awrajas in these two groups are located in southern Bale, southern Sidamo, eastern Gondar, western Wollo, northeast Wollo, and north Harerge regions of Ethiopia. The transportation network in these regions is limited due to land topography. Not surprisingly, CART characterizes these two groups as low in the density of all-weather roads per square kilometer. The regions in these two groups are also known for being among the most vulnerable to famine in Ethiopia. The remaining terminal nodes can be analyzed in a similar way.
Figure 5 displays the power of CART analysis as did Figure 1. It shows that CART has successfully identified 10 groups of awrajas by using only 9 out of the 92 variables submitted for analysis (Table 3). Each group is identified by the path that begins at the root node and ends at its terminal node. The 9 variables along with their split points carry all the information that is needed to differentiate groups of awrajas from each other.
The Steps to Building a Regression Tree
The mechanism for building a regression tree is similar to that for a classification tree. But with a regression tree there is no need to specify priors and misclassification costs. Furthermore, the dependent variable in a regression tree is numeric or continuous. The splitting criterion employed is the within-node sum of squares of the dependent variable and the goodness of a split is measured by the decrease achieved in the weighted sum of squares. Detailed discussion on splitting criteria will be provided further below. The following list highlights the key steps in constructing a regression tree.
1. Starting with the root node, CART performs all possible splits on each of the predictor variables, applies a predefined node impurity measure to each split, and determines the reduction in impurity that is achieved.2. CART then selects the "best" split by applying the goodness-of-split criteria and partitions the data set into left- and right-child nodes.
3. Because CART is recursive, it repeats steps 1 and 2 for each of the nonterminal nodes and produces the largest possible tree.
4. Finally, CART applies its pruning algorithm to the largest tree and produces a sequence of subtrees of different sizes from which an optimal tree is selected.
Splitting Rules and Goodness-of-Fit Criteria
There are two splitting rules or impurity functions for a regression tree. These are (1) the Least Squares (LS) function and (2) the Least Absolute Deviation (LAD) function. Since the mechanism for both rules is the same, only the LS impurity measure will be described. Under the LS criterion, node impurity is measured by within-node sum of squares, SS(t), which is defined as
_{}
where y_{i(t)}= individual values of the dependent variable at node t, and _{}
Æ(s, t)=SS(t)-SS_{(tR)}-SS_{(tL)},
where SS(_{tR}) is the sum of squares of the right child node, and SS(t_{L}) is the sum of squares of the left child node.
The best split is that split for which Æ(s, t) is the highest. From the series of splits generated by a variable at a node, the rule is to choose that split that results in the maximum reduction in the impurity of the parent node.
An alternative to SS(t) is to use the weighted variance of left and right nodes, where the weights are proportions of cases at nodes t_{L} and t_{R} let p(t) = N_{t}/N be the proportion of cases at node t, and let s^{2}(t) be the variance of the dependent variable at node t. The variance is defined as
_{}
The goodness of a split is now measured by
f(s,t)=s^{2}(t)-[p_{L}s^{2}(t_{L})+p_{R}s^{2}(t_{R})].
The best split is the one for which f(s,t) is the highest or for which the weighted sum of the variances [p_{L}s^{2}(t_{L})+p_{R}s^{2}(t_{R})] is the smallest. The procedure successfully separates high values of the dependent variable from its low values and results in left and right nodes that are now internally more homogenous than the parent node. It should be noted that as each split sends observations to the left and right nodes, the mean of the dependent variable in one of the resulting nodes is lower than the mean at the parent node (see the example in Figure 5).
TREE PRUNING
After building the largest possible tree, CART applies its pruning algorithm by using either cross-validation or an independent test sample to measure the goodness of fit of the tree. LS uses Mean Squared Error (MSE) to measure the accuracy of the predictor in order to rank the sequence of trees generated by pruning. LAD employs Mean Absolute Deviation (MAD). Once a minimal-cost tree (the tree with the lowest MSE OR MAD) is identified, an optimal tree is chosen by applying the one-standard-error rule to the minimal-cost tree. The one-standard-error rule is optional and can be changed by the analyst.
After choosing an optimal tree or, for that matter, any subtree from the sequence of subtrees generated in the pruning process, CART computes summary statistics for each of the terminal nodes. If LS is chosen as a splitting rule, CART computes mean and standard deviations of the dependent variable. The mean of the terminal node becomes the predicted value of the dependent variable for cases in that terminal node. If LAD is selected, CART generates median and average absolute mean deviations of the dependent variable. As with LS, the median becomes the predicted value of the dependent variable for that terminal node.
This form of generating predictions may sound crude to those who are familiar with predictions from parametric models. But it should be noted that CART regression predictions are arrived at by recursively splitting the sample and creating groups or clusters that are progressively more homogenous than their ancestor nodes. Breiman et al. (1984) suggest running OLS models in each group created by the regression tree and comparing the OLS predictions against each other. A considerable difference between the predicted values of OLS models for each group is an indication that CART has succeeded in uncovering the complex structure existing in the data set.