Cover Image
close this bookClassification and Regression Trees, Cart TM - A user manual for identifying indicators of vulnerability to famine and chronic food insecurity + disk (IFPRI, 1999, 56 p.)
View the document(introduction...)
View the documentPreface
View the documentAcknowledgments
View the document1. Introduction
View the document2. Overview of CART
View the document3. Basic Principles of Cart Methodology
View the document4. Regression Trees: An Overview
View the document5. CART Software and Program Codes
View the document6. Refining CART Analyses
View the document7. Conclusions
Open this folder and view contentsAppendix 1: Condensed Examples of Classification-Tree Output (full output on diskette)
View the documentAppendix 2: A Condensed Example of Regression-Tree Output (full output on diskette)
View the documentReferences
View the documentBack Cover

4. Regression Trees: An Overview

Recall from Chapter 1 that CART produces a classification tree when the dependent variable is categorical and a regression tree when the dependent variable is continuous. The process of constructing a regression tree is similar to that for a classification tree. But in building a regression tree, there is no need to use priors and class assignment rules. Splitting rules, goodness-of-fit criteria, as well as measures of accuracy of a tree in regression tree differ from those for a classification tree. These issues will all be discussed in detail in the two subsections that follow the regression tree example below.

As with classification, regression-tree building centers on three major components: (1) a set of questions of the form, Is X £ d?, where X is a variable and d is a constant; the reponse to such questions is yes or no; (2) goodness-of-split criteria for choosing the best split on a variable; and (3) generation of summary statistics for terminal nodes. The latter component is unique to a regression tree. In classification trees, the terminal nodes are assigned to a specific class according to the class assignment rule. In regression trees, however, there are no classes to which terminal nodes are assigned. Instead, for each of the terminal nodes produced by CART regression, summary statistics of the dependent variable are computed.

The main purpose of CART regression is to produce a tree-structured predictor or prediction rule (Breiman et al. 1984). This predictor serves two major goals: (1) to predict accurately the dependent variable from the future or new values of the predictor variables; and (2) to explain the relationships that exist between the dependent and predictor variables. The CART regression predictor is constructed by detecting the heterogeneity (in terms of variance of the dependent variable) that exists in the data set and then purifying the data set. CART does this by recursively partitioning a data set into groups or terminal nodes that are internally more homogenous than their ancestor nodes. At each terminal node, the mean value of the dependent variable is taken as the predicted value. If the objective of a regression tree is explanation, then this is achieved by tracking the paths of a tree to a specific terminal node.

An example of a regression tree is given in Figure 5, and the list of variables supplied for generating the tree is given in Table 3.


Figure 5 - CART analysis of 77 awrajas, 1982-87

Table 3 - Variables for awraja-level analysis

Variable

Definition

MZSHTTRD

Retail price of maize/producer price of sheep terms of trade

MZSHTTMN

Average of MZSHTTRD during 1981-87

MZSHTTDV

Standard Deviations of __RD from __MN

MZSHTTCV

Coefficient of variation of __RD during 1981-87

CERLPROD

Gross production of all cereals in tons

CERLMN

Mean of CERLPROD during 1981-87

CERLDV

Standard Deviations of CERLPROD from CERLMN

CERLCV

Coefficient of variation of CERLPROD during 1981-87

PCTBELG

Percent of annual cereal production from Belg season

PCTBLGMN

Average of PCTBELG during 1981-87

PCTBLGCV

Standard Deviations of PCTBELG from PCTBLGMN

PCTBLGCV

Coefficient of variation of PCTBLG during 1981-87

CERLPP

Gross production of all cereals per capita rural population

AVGFAMSZ

Average size of rural household

DEPRATIO

Dependency ratio (and 60 years old /total population 15-59 years old)

LITERATE

Literacy ratio of males 15 years old /total population 15 years old

TOTFERTR

Total fertility rate

GENFERTR

General fertility rate

PAR4549R

Average parity (45-49 years)

ASDRRURL

Age-specific death rates in rural areas

IMRRURAL

Infant mortality rate in rural areas

NPERRMRU

Average people sharing bedroom in rural areas

LIFEEXPR

Life expectancy in rural areas

CRDBRTHR

Crude birth rate in rural areas

GRRERRUR

Gross reproductive rate

MLUPSLRM

Soil loss rate estimates from Master Land Use Plan

POPUME

Urban male population

POPUFE

Urban female population

POPURME

Rural male population

POPRFE

Rural female population

ALLKMKM2

All-weather road/square kilometer

AVGEP84R

Average land elevation weighted by rural population

HLTHFIND

Index of health infrastructure based on need

PRPRFHHD

Share of female heads in total number of household heads

PERENNLO

Percent farmers with no perennial crops

PERENNL1-5

Percent farmers with 1-6 perennial crops

ANNUAL0

Percent farmers with no annual crops

ANNUAL1-8

Percent farmers with 1-8 annual crops

DISTBGMK

Distance to large market (kilometers)

DISTSMMK

Distance to small market (kilometers)

AVGHHINC

Average household income

GINIHINC

Gini coefficient of average household income by awraja

PCTFRMRS

Percent rural population who are farmers

AVGPCINC

Average farm income per capita

GINIPINC

Gini coefficient of AVGPCINC by woreda weighted by population

PCTFRALW

Share farmers that always or sometimes plant belg crop

PCTFRSOM

Share farmers that never plant belg crop

AVGNOXEN

Average number of oxen owned

PCT0OXEN

Percent households with no oxen

ANNLPCHA

Average area cultivated with annual crops per capita

PRNLPCHA

Average area cultivated with perennial crops per capita

ANLAVG

Average area cultivated with annual crops by household

PERLAVG

Average area cultivated with perennial crops by household

FALAVGHA

Average area fallowed by household

AVGARAHA

Average arable land owned

PCTIRRIG

Percent of farmers using irrigation

IRRIGHA

Total irrigated area

GINITLU

Gini coefficient of TLU ownership (all species)

GINIPCMK

Gini coefficient of percent crop marketed

PRIM0014

Percent children years old with any schooling

BELGMN

Average NDVI for Belg season by year

BELGMX

Maximum NDVI for the season, average for all pixels by awraja

BELGMNMN

__MN average for 1982-90

BELGMXMN

__MX average for 1982-90

BELGMNCV

__MN coefficient of variation for 1982-90

BELGMXCV

__MX coefficient of variation for 1982-90

BELGMNDV

Standard deviations of __MN from __MNMN

BELGMXDV

Standard deviations of __MX from __MXMN

BELGSDMN

Standard deviations of season average during 1982-90

BELGSXMN

Standard deviations of season maximum during 1982-90

KIREMMN

Average NDVI for Kirempt season by year

KIREMMX

Maximum NDVI for the season, average for all pixels by awraja

KIRMNMN

__MN average for 1982-90

KIRMXMN

__MX average for 1982-90

KIRMNCV

__MN coefficient of variation for 1982-90

KIRMXCV

__MX coefficient of variation for 1982-90

KIRMNDV

Standard deviations of __MN from __MNMN

KIRMXDV

Standard deviations of __MX from __MXMN

KIRMSDMN

Standard deviations of season average during 1982-90

KIRMSXMN

Standard deviations of season maximum during 1982-90

BEGAMN

Average NDVI for Bega season by year

BEGAMX

Maximum NDVI for the season, average for all pixels by awraja

BEGAMNMN

__MN average for 1982-90

BEGAMXMN

__MX average for 1982-90

BEGAMNCV

__MN coefficient of variation for 1982-90

BEGAMXCV

__MX coefficient of variation for 1982-90

BEGAMNDV

Standard deviations of __MN from __MNMN

BEGAMXDV

Standard deviations of __MX from __MXMN

BEGASDMN

Standard deviations of season average during 1982 - 90

BEGASXMN

Standard deviations of season maximum during 1982 - 90,

NDVIMNMX

Maximum of mean NDVIs for 3 seasons averaged for 1982-90

NDVIMXMX

Maximum of season NDVI maxima averaged for 1982-90

URBPOPSR

Percent urban population by awraja

Note: An awraja is an administrative district in Ethiopia below the province level; a woreda is an administrative district below the awraja level.

BUILDING A REGRESSION TREE

The process of constructing a regression tree is similar to that for building a classification tree. Regression-tree building centers on three major components: (1) a set of questions of the form,

Is X £ d?,

where X is a variable and d is a constant. As with classification, the response to such questions is yes or no; (2) goodness-of-split criteria for choosing the best split on a variable; and (3) the generation of summary statistics for terminal nodes (unique to a regression tree).

An example of a regression tree is given in Figure 5, and the list of variables supplied for generating the tree is given in Table 3.

REGRESSION TREE: EXAMPLE

The regression tree in Figure 5 is based on analysis from a regional vulnerability study in Ethiopia (Seyoum et al. 1995) that uses six years (1982-87) of time-series data collected from 77 administrative regions (awrajas) of Ethiopia. The data contain 92 variables, all listed in Table 3. This study of famine (Seyoum et al. 1995) had two specific goals: (1) to determine whether it is possible to estimate or predict the percent of sedentary population in need of food assistance, and (2) to understand the variability in percentages of people in need (PPND) across awrajas and years. The dependent variable in the study is PPND.

The top rectangle in Figure 5 contains a total number of 462 observations (N=462) with an average PPND of 11 percent. (During the six-year period of the study, an average of 11 percent of the population was in need of food assistance.) The regression tree produces 10 terminal nodes or homogenous groups or awraja strata. Each group is identified by a number from 1 to 10. The specific path leading from the root node to the terminal node for each group characterizes that group. In Figure 5, NDVI (normalized difference vegetation index) is a crude estimate of vegetation health, and is used as an index of greenness. The possible range of values for NDVI is between -1 and 1. However, its typical range is between -0.1 (for not a green area) and 0.6 (for a very green area). The higher the index, the greener the vegetation.

The first split of the root node is based on the long-term average NDVI variable. This split successfully separates awrajas with less green vegetation from awrajas with very green vegetation. The long-term average NDVI is indeed a powerfully discriminating variable for studying regional vulnerability. In awrajas with very green vegetation, average PPND is 3 percent, which is much lower than awrajas with less green vegetation. Awrajas with greener vegetation are further separated using the variable for the long-term average maximum NDW of the main rainy season. This split results in two terminal nodes: Group 9 and Group 10. Predicted PPND is 9 percent in Group 9 and 2 percent in Group 10. The low PPND for these two groups should not be surprising. It can be argued that these regions have better supplies of food and, hence, food accessibility, than awrajas with less green vegetation. Indeed, it turns out that these awrajas extend west, south, southwest, and northwest from central Ethiopia (Webb et al. 1994, Map 6.0). These awrajas also produce surplus grain in the country. Some awrajas in Group 9 do represent pockets of vulnerability in this surplus-producing region.

Awrajas in Groups 1 through 8 have at least one characteristic in common. They all descend from awrajas with a less green vegetation index (long-term average NDVI £ 0.335). Group 1 awrajas are characterized by low long-term average NDVI, low sheep-to-maize terms of trade, and low coefficient of variation of dry season NDVI. There are 13 awrajas at this terminal node with a predicted PPND of 14 percent. The fact that the long-term average NDVI is low suggests that the long-term annual average rainfall in these awrajas is very low and crop production is limited. This observation is justified by the low sheep-to-maize terms of trade. A household can only buy 31.4 kilograms or less of maize with one sheep, indicating that maize is scarce in these areas. These awrajas are in south Gamgofa, northeast Shoa, northeast Bale, and west Hararge regions of Ethiopia. Generally, rainfall in these regions is far below the national average.

Awrajas in Group 2 and Group 3 are both characterized by low long-term average NDVI, low sheep-to-maize terms of trade, a high coefficient of variation of dry season NDVI, and low density of all-weather roads per square kilometer. They are distinct from each other only because of household size. Group 2 awrajas have a lower household size than those in Group 3. For the three awrajas in Group 2, predicted PPND equals 74 percent. For the 21 awrajas in Group 3, predicted PPND equals 23 percent. The awrajas in these two groups are located in southern Bale, southern Sidamo, eastern Gondar, western Wollo, northeast Wollo, and north Harerge regions of Ethiopia. The transportation network in these regions is limited due to land topography. Not surprisingly, CART characterizes these two groups as low in the density of all-weather roads per square kilometer. The regions in these two groups are also known for being among the most vulnerable to famine in Ethiopia. The remaining terminal nodes can be analyzed in a similar way.

Figure 5 displays the power of CART analysis as did Figure 1. It shows that CART has successfully identified 10 groups of awrajas by using only 9 out of the 92 variables submitted for analysis (Table 3). Each group is identified by the path that begins at the root node and ends at its terminal node. The 9 variables along with their split points carry all the information that is needed to differentiate groups of awrajas from each other.

The Steps to Building a Regression Tree

The mechanism for building a regression tree is similar to that for a classification tree. But with a regression tree there is no need to specify priors and misclassification costs. Furthermore, the dependent variable in a regression tree is numeric or continuous. The splitting criterion employed is the within-node sum of squares of the dependent variable and the goodness of a split is measured by the decrease achieved in the weighted sum of squares. Detailed discussion on splitting criteria will be provided further below. The following list highlights the key steps in constructing a regression tree.

1. Starting with the root node, CART performs all possible splits on each of the predictor variables, applies a predefined node impurity measure to each split, and determines the reduction in impurity that is achieved.

2. CART then selects the "best" split by applying the goodness-of-split criteria and partitions the data set into left- and right-child nodes.

3. Because CART is recursive, it repeats steps 1 and 2 for each of the nonterminal nodes and produces the largest possible tree.

4. Finally, CART applies its pruning algorithm to the largest tree and produces a sequence of subtrees of different sizes from which an optimal tree is selected.

Splitting Rules and Goodness-of-Fit Criteria

There are two splitting rules or impurity functions for a regression tree. These are (1) the Least Squares (LS) function and (2) the Least Absolute Deviation (LAD) function. Since the mechanism for both rules is the same, only the LS impurity measure will be described. Under the LS criterion, node impurity is measured by within-node sum of squares, SS(t), which is defined as


, for i=1,2,...,Nt

where yi(t)= individual values of the dependent variable at node t, and


the mean of the dependent variable at node t. Given the impurity function, SS(t), and split s that sends cases to left (tL) and right (tR) nodes, the goodness of a split is measured by the function

Æ(s, t)=SS(t)-SS(tR)-SS(tL),

where SS(tR) is the sum of squares of the right child node, and SS(tL) is the sum of squares of the left child node.

The best split is that split for which Æ(s, t) is the highest. From the series of splits generated by a variable at a node, the rule is to choose that split that results in the maximum reduction in the impurity of the parent node.

An alternative to SS(t) is to use the weighted variance of left and right nodes, where the weights are proportions of cases at nodes tL and tR let p(t) = Nt/N be the proportion of cases at node t, and let s2(t) be the variance of the dependent variable at node t. The variance is defined as


The goodness of a split is now measured by

f(s,t)=s2(t)-[pLs2(tL)+pRs2(tR)].

The best split is the one for which f(s,t) is the highest or for which the weighted sum of the variances [pLs2(tL)+pRs2(tR)] is the smallest. The procedure successfully separates high values of the dependent variable from its low values and results in left and right nodes that are now internally more homogenous than the parent node. It should be noted that as each split sends observations to the left and right nodes, the mean of the dependent variable in one of the resulting nodes is lower than the mean at the parent node (see the example in Figure 5).

TREE PRUNING

After building the largest possible tree, CART applies its pruning algorithm by using either cross-validation or an independent test sample to measure the goodness of fit of the tree. LS uses Mean Squared Error (MSE) to measure the accuracy of the predictor in order to rank the sequence of trees generated by pruning. LAD employs Mean Absolute Deviation (MAD). Once a minimal-cost tree (the tree with the lowest MSE OR MAD) is identified, an optimal tree is chosen by applying the one-standard-error rule to the minimal-cost tree. The one-standard-error rule is optional and can be changed by the analyst.

After choosing an optimal tree or, for that matter, any subtree from the sequence of subtrees generated in the pruning process, CART computes summary statistics for each of the terminal nodes. If LS is chosen as a splitting rule, CART computes mean and standard deviations of the dependent variable. The mean of the terminal node becomes the predicted value of the dependent variable for cases in that terminal node. If LAD is selected, CART generates median and average absolute mean deviations of the dependent variable. As with LS, the median becomes the predicted value of the dependent variable for that terminal node.

This form of generating predictions may sound crude to those who are familiar with predictions from parametric models. But it should be noted that CART regression predictions are arrived at by recursively splitting the sample and creating groups or clusters that are progressively more homogenous than their ancestor nodes. Breiman et al. (1984) suggest running OLS models in each group created by the regression tree and comparing the OLS predictions against each other. A considerable difference between the predicted values of OLS models for each group is an indication that CART has succeeded in uncovering the complex structure existing in the data set.