
| Classification and Regression Trees, Cart TM - A user manual for identifying indicators of vulnerability to famine and chronic food insecurity + disk (IFPRI, 1999, 56 p.) |
CART software is currently available for different platforms, as shown in Table 4. Details on the current versions of CART software that are compatible with different platforms may be obtained from the vendor listed in Table 4.
The software comes with two completely documented manuals that are easy to follow. The first manual (Steinberg and Colla 1995) provides a comprehensive background and conceptual basis for understanding CART. It also discusses the art of tree-structured data analysis, provides detailed listings and explanations of CART commands in SYSTAT syntax, and explains how to use CART techniques and interpret results. Even though CART commands are in SYSTAT syntax, CART software is a stand-alone application that does not need SYSTAT software. The second manual (Steinberg, Colla, and Martin 1998) is for the Windows operating systems (Windows 3.x and Windows 95/NT). A detailed tutorial covers the use of menus, the mouse, the graphic interface, and many other features that are specific to the Windows version.
The graphic interface feature of Windows is an extremely useful tool for CART data analysts. Windows enables CART simultaneously to show tree topology and the quality of an optimal tree through a graphic display of relative costs of trees versus the number of terminal nodes. CART's node navigator feature enables the analyst to immediately perform exploratory work on trees of different sizes and determine node summary information for each examined tree. Thus the analyst can inspect different trees immediately in case the optimal tree becomes unsatisfactory. Any tree can be inspected by clicking on a tree from the series displayed graphically at the lower panel of the node navigator. Node summary information for each tree can be generated for the level of detail desired. The results are displayed graphically in the form of an inverted tree. This is an improvement over earlier versions of CART, in which tree-structured graphs had to be produced manually. In the Windows version the analyst is not limited to using only menus. He/she can write CART commands in batch mode and submit them for analysis while making use of all other features available in Windows.
The rest of this chapter introduces basic CART commands and batch mode programs written in SYSTAT syntax. A few basic CART commands are provided in Table 5. For greater detail about CART commands, the reader should refer to Steinberg and Colla (1995) or contact the vendor listed in Table 4.
Table 4 - Hardware and software requirements of CART for personal computers
|
Hardware and software | |
|
Hardware requirements: |
Intel PCs, SUN, SGI, HP, Digital Alpha and VAX, IBMRS600 |
|
Operating systems supported: |
Windows 3.X, Windows 95. Windows NT, MacOS, UNIX, IBM |
| |
MVS and CMS |
|
Memory requirements: |
May vary with versions of CART software, CART for Windows is compiled for machines with at least 32 megabytes of RAM. For optimal performance, Pentium machines with at least 32 megabytes of RAM are recommended. |
|
Hard disk space: |
At least 10 megabytes for software storage |
|
Company name: |
Salford Systems |
|
Address: |
8888 Bio San Diego Dr., Suite 1045 San Diego, California 92108 U.S.A. |
|
Web address: |
http://www.salford-systems.com |
|
Telephone: |
(619) 543-8880 |
|
Fax: |
(619) 543-8888 |
|
Technical support: |
Available either by telephone, fax, or letter. |
|
Number of variables and observations: |
Computing requires a minimum of 16 megabytes of free memory. Number of observations and variables supported depend on the available memory. |
Source: Fax message received from Salford Systems, February 1998, and
http://www.salford-systems.com/technical-CART.html, July 9, 1998.
PREPARATION OF CART DATA FILES
CART can only read and process data files that are in SYSTAT format. Therefore, the data for analysis should be prepared in SYSTAT. If data are in other formats, they should be converted to a SYSTAT format using either DBMSCOPY or the translation utility that comes with CART software.
ACCESSING CART
CART can be invoked in two ways. The DOS version can be accessed by typing CART at the prompt of the operating system and pressing the enter key. In the Windows version, CART is invoked by double-clicking on the CART icon.
CART COMMANDS IN BATCH MODE
CART commands should be written in SYSTAT syntax using any available editor. The following commands produce a classification tree.
USE D:CART1989POOLSUB5.SYS'
CATEGORY CUTDUM2
MODEL
CUTDUM2
BUILD
Table 5 - Basic CART software commands in SYSTAT
|
Command |
Syntax |
Function (purpose) |
Examples |
|
USE |
USE filename |
Specifies a file to read |
USE c:CART est1.sys |
|
EXCLUDE |
EXCLUDE variable list |
Excludes from file the variables not needed in the analysis |
EXCLUDE hhid code |
|
KEEP |
KEEP variable list |
Reads from the file only the variables needed in the analysis |
KEEP age sex income |
|
CATEGORY |
Category variable list |
Specifies list of categorical variables in the data set, including the dependent variable - this is compulsory in a classification tree |
CATEGORY sex |
|
DEL |
MODEL variable name |
Specifies dependent variable |
MODEL vulner10 |
|
BUILD |
BUILD |
Tells CART to produce a tree |
BUILD |
|
QUIT |
QUIT |
If submitted while in BUILD, it tells CART to quit the session; if submitted after CART session, it tells CART to go to DOS. | |
|
SELECT |
SELECT variable name relation operator or constant/character or |
Selects a subset of the data set for analysis |
SELECT age>15 |
|
SELECT |
SELECT variable name relation operator or constant/character, variable name relation operator or constant/character |
Selects a subset of the data set for analysis |
SELECT age > 15, |
|
PRIORS |
PBIORS option (Choose I option only) |
Specifies which PRIORS to use |
PRIORS data |
|
MISCLASS COST |
MISCLASS COST=n |
Assigns nonunit misclassification costs |
Misclass cost=2 |
|
METHOD |
METHOD=option |
Specifies splitting rule |
Method=gini(default) |
|
OUTPUT |
OUTPUT filename |
Sends output to a named file |
OUTPUT=LMS |
|
TREE |
TREE tree file name |
Specifies a file name of a tree to save |
TREE vulner1 |
|
SAVE |
SAVE filename options |
Specifies file name of a data set with predicted class(es), select options to save |
SAVE predct1 |
|
CASE |
CASE options |
Runs data one by one down a tree, select option(s) to use |
CASE |
These four lines are mandatory. They are the only commands needed to produce a classification tree. For a regression tree, the CATEGORY command line is not needed at all, and the dependent variable that follows the MODEL command should be a continuous variable. To produce a regression tree, the only three commands needed are USE, MODEL, and BUILD. Examples of regression-tree command lines are provided toward the end of this chapter.
The data analyst has many options to modify this program. All optional command lines are additions to this basic program. Any optional command line(s) should be entered before the BUILD command. For example, if the analyst wants to save the output to a file, the OUTPUT command should be inserted as follows:
Syntax: OUTPUT 'd:cart 1989any name'
With the addition of the OUTPUT command, the entire program would read:
USE D:CART1989POOLSUB5.SYS'
CATEGORY CUTDUM2
MODEL CUTDUM2
OUTPUT 'D:CART1989VPDAT.DAT'
BUILD
The OUTPUT command sends the output results to a file named VPDAT.DAT.
PROGRAM REFINEMENTS
Sometimes the initial program may not produce a satisfactory tree. In such cases, the program can be modified in a number of ways. The easiest way is to change either priors or misclassification costs or both. If priors are not specified by the analyst, the default is priors equal. The analyst can also change the default splitting rules, the one-standard-error rule, the complexity parameter, and so on. This manual covers only the simplest options.
Refinement 1
The default priors can be changed by choosing either PRIORS DATA or PRIORS MIXED and adding it into the batch program. For example, if PRIORS DATA is chosen, the modified program will look like this:
USE D:CART1989POOLSUB5.SYS'
CATEGORY CUTDUM2
MODEL CUTDUM2
PRIORS DATA
OUTPUT 'D:CART1989VPDAT.DAT'
BUILD
Refinement 2
In addition to changing priors to "data" or "mixed," the analyst can also incorporate external information into the program by assigning explicit values to priors. In such cases, the underlying assumption is that the distribution of observations into classes of the dependent variable may occur in proportions other than priors equal, priors data, or priors mixed.
For example, in a two-class problem, the analyst may assign
PRIORS =.2,.8, or
PRIORS = 1, 5, or
PRIORS = 1.2, 1, and so on.
The latter priors says that the proportion of Class 0 cases in the population from which the sample is drawn is 20 percent higher than the proportion of Class 1 cases.
With these changes, the program looks like this:
USE D:CART1989POOLSUB5.SYS'
CATEGORY CUTDUM2
MODEL CUTDUM2
PRIORS = 1, 5
OUTPUT 'D:CART1989VPDAT.DAT'
BUILD
Refinement 3
So far, the analysis is based on equal or unit misclassification costs, which is the default setting. This setting can be changed by imposing severe costs for misclassifying certain serious cases. If a heart-attack patient is misclassified as a healthy individual during medical diagnosis, the cost is far more serious than the cost of classifying a healthy individual as a heart-attack patient. In vulnerability studies, classifying food-insecure households as food-secure is more costly than classifying food-secure households as food-insecure. Two options are available for reducing the misclassification of such serious cases.
1. Change the misclassification costs via altered priors. For example, suppose classifying Class 1 cases as Class 0 is three times more costly than classifying Class 0 cases as Class 1. This situation can be treated as if the distribution of Class 1 cases in the population is three times as large as that of Class 0. This information is entered in the PRIORS command line, and the entire batch program now reads as follows:
USE D:CART1989POOLSUB5.SYS'
CATEGORY CUTDUM2
MODEL CUTDUM2
PRIORS =1,3
OUTPUT'D:CART1989VPDAT.DAT'
BUILD
2. Introduce misclassification costs explicitly into the command line.
Example:
|
MISCLASS COST |
= |
5 CLASSIFY 0 AS 1, |
|
COST |
= |
2 CLASSIFY 1 AS 0. |
This means that the cost of classifying a Class 0 case as Class 1=5, while the cost of classifying a Class 1 case as Class 0 is 2. The example associates different penalties or costs with each misclassification error.
With these additions, the program looks like the following:
USE 'D:CARATPOOLSUB5.SYS'
EXCLUDE SITE HHID
CATEGORY CUTDUM2
MODEL CUTDUM2
PRIORS DATA
MISCLASS COST = 5 classify 0 as 1,
COST = 2 classify 1 as 0
OUTPUT 'D:CARATVPDAT.DAT'
BUILD
Refinement 4
This refinement involves the MODEL command. The analyst may limit the number of variables in the analysis by explicitly specifying the model as in a parametric model. This option is helpful especially in cases where it may not be possible to access a computer with a large memory.
Example: MODEL CUTDUM2 = NCERYL80 + PCLSU80 + GINI + PCDCALS + FARMING + HHSIZE.
One can also use the EXCLUDE command to exclude variables that are not needed in the analysis.
Refinement 5
The data analyst may change the default splitting rule (Gini criteria) by using the METHOD command. For example, METHOD = LINEAR changes the default splitting criteria to linear combination splits. In this case, the METHOD command should follow the MODEL command. Under this splitting criteria, CART assumes that all of the variables in the linear combination are numeric. Therefore, unless categorical variables are transformed into sets of dummy variables, they will be treated as numeric variables.
REGRESSION TREE PROGRAM CODES
The commands needed for producing a regression tree are basically the same as that for a classification tree. There is no need to specify the CATEGORY and MISCLASS COST commands in regression tree programs. As pointed out earlier, the three basic commands that are needed for producing a regression tree are the USE, MODEL, and BUILD commands.
Consider the following typical regression-tree programs:
(A)
|
USE 'D:CARATYEAR8187.SYS' | ||
|
MODEL PPND |
= |
NDVIMNMX KRMTMNMN NDVIMXMX KRMTMXMN BEGAMNMN BEGAMXMN MZSHTTRD MZSHTTDV BEGAMNDV BELGMNDV KRMTMXDV |
OUTPUT 'D:CARATYEAR8187.0UT'
BUILD
(B)
USE'D:CARATYEAR8187.SYS'
MODEL PPND
OUTPUT 'D:CARATYRO1.OUT'
BUILD
As with classification trees, the OUPUT command is optional. The analyst can modify this basic program by adding any of the available optional command lines into the program. In example (A), the dependent and independent variables are specified in the MODEL command. This option is useful in situations where access to a computer with large memory is limited. Option (B) uses all of the available variables in the data set and produces a regression tree. This option is especially useful if the analyst does not have any prior information about which predictors or potential predictors to use in the model.
SAVING CART TREES FOR FUTURE USE
It maybe useful to recall that the main objective of running either classification or regression trees is to use the resulting tree for classifying data or predicting the class of a new observation. CART does this by dropping the data down the tree case by case, beginning from the root node. At each stage the splitting criteria are applied until the observations end up in any one of the terminal nodes. This task is accomplished by using only the USE, TREE, SAVE, and CASE commands. It should be noted that the extension of the filename created by the TREE command is always TR1 and cannot be changed. The complete program for building and saving a tree is as follows:
USE 'D:CARATPOOLSUB5.SYS'
CATEGORY CUTDUM2
MODEL CUTDUM2
TREE SECUR1
BUILD
The TREE command produces a file called SECUR1.TR1.
Suppose the analyst has a new data set called DATANEW.SYS, which contains the characteristics of new cases with an unknown class distribution. The analyst now wants to run this data down the saved tree (SECUR1.TR1) to find out the classes into witch the new cases fall, and to save the case-bye-case results in a data file called PREDCT.SYS. Using the CASE command line, this is written as follows:
USE 'D:DATANEW.SYS'
TREE SECUR1
SAVE PREDECT/SINGLE
IDVAR HHID
CASE
The IDVAR command line adds the identification variable (HHID) to the file PREDCT.SYS, which is created by the CASE command. The contents of the PREDCT.SYS file include the original variables used in the model and a few new variables created by CART. The RESPONSE and CORRECT variables are the most useful of the new variables. The RESPONSE variable contains the class assigned to an observation by CART. The CORRECT variable is an indicator variable. It equals 1 for correct prediction and 0 for incorrect prediction.