Cover Image
close this bookClassification and Regression Trees, Cart TM - A user manual for identifying indicators of vulnerability to famine and chronic food insecurity + disk (IFPRI, 1999, 56 p.)
View the document(introduction...)
View the documentPreface
View the documentAcknowledgments
View the document1. Introduction
View the document2. Overview of CART
View the document3. Basic Principles of Cart Methodology
View the document4. Regression Trees: An Overview
View the document5. CART Software and Program Codes
View the document6. Refining CART Analyses
View the document7. Conclusions
Open this folder and view contentsAppendix 1: Condensed Examples of Classification-Tree Output (full output on diskette)
View the documentAppendix 2: A Condensed Example of Regression-Tree Output (full output on diskette)
View the documentReferences
View the documentBack Cover

5. CART Software and Program Codes

CART software is currently available for different platforms, as shown in Table 4. Details on the current versions of CART software that are compatible with different platforms may be obtained from the vendor listed in Table 4.

The software comes with two completely documented manuals that are easy to follow. The first manual (Steinberg and Colla 1995) provides a comprehensive background and conceptual basis for understanding CART. It also discusses the art of tree-structured data analysis, provides detailed listings and explanations of CART commands in SYSTAT syntax, and explains how to use CART techniques and interpret results. Even though CART commands are in SYSTAT syntax, CART software is a stand-alone application that does not need SYSTAT software. The second manual (Steinberg, Colla, and Martin 1998) is for the Windows operating systems (Windows 3.x and Windows 95/NT). A detailed tutorial covers the use of menus, the mouse, the graphic interface, and many other features that are specific to the Windows version.

The graphic interface feature of Windows is an extremely useful tool for CART data analysts. Windows enables CART simultaneously to show tree topology and the quality of an optimal tree through a graphic display of relative costs of trees versus the number of terminal nodes. CART's node navigator feature enables the analyst to immediately perform exploratory work on trees of different sizes and determine node summary information for each examined tree. Thus the analyst can inspect different trees immediately in case the optimal tree becomes unsatisfactory. Any tree can be inspected by clicking on a tree from the series displayed graphically at the lower panel of the node navigator. Node summary information for each tree can be generated for the level of detail desired. The results are displayed graphically in the form of an inverted tree. This is an improvement over earlier versions of CART, in which tree-structured graphs had to be produced manually. In the Windows version the analyst is not limited to using only menus. He/she can write CART commands in batch mode and submit them for analysis while making use of all other features available in Windows.

The rest of this chapter introduces basic CART commands and batch mode programs written in SYSTAT syntax. A few basic CART commands are provided in Table 5. For greater detail about CART commands, the reader should refer to Steinberg and Colla (1995) or contact the vendor listed in Table 4.

Table 4 - Hardware and software requirements of CART for personal computers

Hardware and software


Hardware requirements:

Intel PCs, SUN, SGI, HP, Digital Alpha and VAX, IBMRS600

Operating systems supported:

Windows 3.X, Windows 95. Windows NT, MacOS, UNIX, IBM


MVS and CMS

Memory requirements:

May vary with versions of CART software, CART for Windows is compiled for machines with at least 32 megabytes of RAM. For optimal performance, Pentium machines with at least 32 megabytes of RAM are recommended.

Hard disk space:

At least 10 megabytes for software storage

Company name:

Salford Systems

Address:

8888 Bio San Diego Dr., Suite 1045 San Diego, California 92108 U.S.A.

Web address:

http://www.salford-systems.com

Telephone:

(619) 543-8880

Fax:

(619) 543-8888

Technical support:

Available either by telephone, fax, or letter.

Number of variables and observations:

Computing requires a minimum of 16 megabytes of free memory. Number of observations and variables supported depend on the available memory.

Source: Fax message received from Salford Systems, February 1998, and
http://www.salford-systems.com/technical-CART.html, July 9, 1998.

PREPARATION OF CART DATA FILES

CART can only read and process data files that are in SYSTAT format. Therefore, the data for analysis should be prepared in SYSTAT. If data are in other formats, they should be converted to a SYSTAT format using either DBMSCOPY or the translation utility that comes with CART software.

ACCESSING CART

CART can be invoked in two ways. The DOS version can be accessed by typing CART at the prompt of the operating system and pressing the enter key. In the Windows version, CART is invoked by double-clicking on the CART icon.

CART COMMANDS IN BATCH MODE

CART commands should be written in SYSTAT syntax using any available editor. The following commands produce a classification tree.

USE D:CART1989POOLSUB5.SYS'
CATEGORY CUTDUM2
MODEL CUTDUM2
BUILD

Table 5 - Basic CART software commands in SYSTAT

Command

Syntax

Function (purpose)

Examples

USE

USE filename

Specifies a file to read

USE c:CART est1.sys

EXCLUDE

EXCLUDE variable list

Excludes from file the variables not needed in the analysis

EXCLUDE hhid code

KEEP

KEEP variable list

Reads from the file only the variables needed in the analysis

KEEP age sex income

CATEGORY

Category variable list

Specifies list of categorical variables in the data set, including the dependent variable - this is compulsory in a classification tree

CATEGORY sex

DEL

MODEL variable name

Specifies dependent variable

MODEL vulner10

BUILD

BUILD

Tells CART to produce a tree

BUILD

QUIT

QUIT

If submitted while in BUILD, it tells CART to quit the session; if submitted after CART session, it tells CART to go to DOS.


SELECT

SELECT variable name relation operator or constant/character or

Selects a subset of the data set for analysis

SELECT age>15
SELECT sex =1
SELECT X ³ 20
SELECT x1='M'

SELECT

SELECT variable name relation operator or constant/character, variable name relation operator or constant/character

Selects a subset of the data set for analysis

SELECT age > 15,
Wage > 300

PRIORS

PBIORS option (Choose I option only)

Specifies which PRIORS to use

PRIORS data
PRIORS equal
PRIORS mixed
PRIORS=n1,n2,,.,,na (n's are real numbers)

MISCLASS COST

MISCLASS COST=n
classify I as k1,k2,k3/,
Cost=m classify I as k1/,
Cost=1 classify k1,k2,..,tn
as x

Assigns nonunit misclassification costs

Misclass cost=2
classify 1 as 2,3,4/,
Cost=5 classify 3 as 1
Cost=3 classify 1,2,3
as 4

METHOD

METHOD=option
(choose 1 option only)

Specifies splitting rule

Method=gini(default)
or
Method=twoing or
Method=LS or LAD
Method=LINEAR

OUTPUT

OUTPUT filename

Sends output to a named file

OUTPUT=LMS

TREE

TREE tree file name

Specifies a file name of a tree to save

TREE vulner1

SAVE

SAVE filename options

Specifies file name of a data set with predicted class(es), select options to save

SAVE predct1

CASE

CASE options

Runs data one by one down a tree, select option(s) to use

CASE

These four lines are mandatory. They are the only commands needed to produce a classification tree. For a regression tree, the CATEGORY command line is not needed at all, and the dependent variable that follows the MODEL command should be a continuous variable. To produce a regression tree, the only three commands needed are USE, MODEL, and BUILD. Examples of regression-tree command lines are provided toward the end of this chapter.

The data analyst has many options to modify this program. All optional command lines are additions to this basic program. Any optional command line(s) should be entered before the BUILD command. For example, if the analyst wants to save the output to a file, the OUTPUT command should be inserted as follows:

Syntax: OUTPUT 'd:cart 1989any name'

With the addition of the OUTPUT command, the entire program would read:

USE D:CART1989POOLSUB5.SYS'
CATEGORY CUTDUM2
MODEL CUTDUM2
OUTPUT 'D:CART1989VPDAT.DAT'
BUILD

The OUTPUT command sends the output results to a file named VPDAT.DAT.

PROGRAM REFINEMENTS

Sometimes the initial program may not produce a satisfactory tree. In such cases, the program can be modified in a number of ways. The easiest way is to change either priors or misclassification costs or both. If priors are not specified by the analyst, the default is priors equal. The analyst can also change the default splitting rules, the one-standard-error rule, the complexity parameter, and so on. This manual covers only the simplest options.

Refinement 1

The default priors can be changed by choosing either PRIORS DATA or PRIORS MIXED and adding it into the batch program. For example, if PRIORS DATA is chosen, the modified program will look like this:

USE D:CART1989POOLSUB5.SYS'
CATEGORY CUTDUM2
MODEL CUTDUM2
PRIORS DATA
OUTPUT 'D:CART1989VPDAT.DAT'
BUILD

Refinement 2

In addition to changing priors to "data" or "mixed," the analyst can also incorporate external information into the program by assigning explicit values to priors. In such cases, the underlying assumption is that the distribution of observations into classes of the dependent variable may occur in proportions other than priors equal, priors data, or priors mixed.

For example, in a two-class problem, the analyst may assign

PRIORS =.2,.8, or
PRIORS = 1, 5, or
PRIORS = 1.2, 1, and so on.

The latter priors says that the proportion of Class 0 cases in the population from which the sample is drawn is 20 percent higher than the proportion of Class 1 cases.

With these changes, the program looks like this:

USE D:CART1989POOLSUB5.SYS'
CATEGORY CUTDUM2
MODEL CUTDUM2
PRIORS = 1, 5
OUTPUT 'D:CART1989VPDAT.DAT'
BUILD

Refinement 3

So far, the analysis is based on equal or unit misclassification costs, which is the default setting. This setting can be changed by imposing severe costs for misclassifying certain serious cases. If a heart-attack patient is misclassified as a healthy individual during medical diagnosis, the cost is far more serious than the cost of classifying a healthy individual as a heart-attack patient. In vulnerability studies, classifying food-insecure households as food-secure is more costly than classifying food-secure households as food-insecure. Two options are available for reducing the misclassification of such serious cases.

1. Change the misclassification costs via altered priors. For example, suppose classifying Class 1 cases as Class 0 is three times more costly than classifying Class 0 cases as Class 1. This situation can be treated as if the distribution of Class 1 cases in the population is three times as large as that of Class 0. This information is entered in the PRIORS command line, and the entire batch program now reads as follows:

USE D:CART1989POOLSUB5.SYS'
CATEGORY CUTDUM2
MODEL CUTDUM2
PRIORS =1,3
OUTPUT'D:CART1989VPDAT.DAT'
BUILD

2. Introduce misclassification costs explicitly into the command line.

Example:

MISCLASS COST

=

5 CLASSIFY 0 AS 1,

COST

=

2 CLASSIFY 1 AS 0.

This means that the cost of classifying a Class 0 case as Class 1=5, while the cost of classifying a Class 1 case as Class 0 is 2. The example associates different penalties or costs with each misclassification error.

With these additions, the program looks like the following:

USE 'D:CARATPOOLSUB5.SYS'
EXCLUDE SITE HHID
CATEGORY CUTDUM2
MODEL CUTDUM2
PRIORS DATA
MISCLASS COST = 5 classify 0 as 1,
COST = 2 classify 1 as 0
OUTPUT 'D:CARATVPDAT.DAT'
BUILD

Refinement 4

This refinement involves the MODEL command. The analyst may limit the number of variables in the analysis by explicitly specifying the model as in a parametric model. This option is helpful especially in cases where it may not be possible to access a computer with a large memory.

Example: MODEL CUTDUM2 = NCERYL80 + PCLSU80 + GINI + PCDCALS + FARMING + HHSIZE.

One can also use the EXCLUDE command to exclude variables that are not needed in the analysis.

Refinement 5

The data analyst may change the default splitting rule (Gini criteria) by using the METHOD command. For example, METHOD = LINEAR changes the default splitting criteria to linear combination splits. In this case, the METHOD command should follow the MODEL command. Under this splitting criteria, CART assumes that all of the variables in the linear combination are numeric. Therefore, unless categorical variables are transformed into sets of dummy variables, they will be treated as numeric variables.

REGRESSION TREE PROGRAM CODES

The commands needed for producing a regression tree are basically the same as that for a classification tree. There is no need to specify the CATEGORY and MISCLASS COST commands in regression tree programs. As pointed out earlier, the three basic commands that are needed for producing a regression tree are the USE, MODEL, and BUILD commands.

Consider the following typical regression-tree programs:

(A)

USE 'D:CARATYEAR8187.SYS'

MODEL PPND

=

NDVIMNMX KRMTMNMN NDVIMXMX KRMTMXMN BEGAMNMN BEGAMXMN MZSHTTRD MZSHTTDV BEGAMNDV BELGMNDV KRMTMXDV

OUTPUT 'D:CARATYEAR8187.0UT'
BUILD

(B)

USE'D:CARATYEAR8187.SYS'
MODEL PPND
OUTPUT 'D:CARATYRO1.OUT'
BUILD

As with classification trees, the OUPUT command is optional. The analyst can modify this basic program by adding any of the available optional command lines into the program. In example (A), the dependent and independent variables are specified in the MODEL command. This option is useful in situations where access to a computer with large memory is limited. Option (B) uses all of the available variables in the data set and produces a regression tree. This option is especially useful if the analyst does not have any prior information about which predictors or potential predictors to use in the model.

SAVING CART TREES FOR FUTURE USE

It maybe useful to recall that the main objective of running either classification or regression trees is to use the resulting tree for classifying data or predicting the class of a new observation. CART does this by dropping the data down the tree case by case, beginning from the root node. At each stage the splitting criteria are applied until the observations end up in any one of the terminal nodes. This task is accomplished by using only the USE, TREE, SAVE, and CASE commands. It should be noted that the extension of the filename created by the TREE command is always TR1 and cannot be changed. The complete program for building and saving a tree is as follows:

USE 'D:CARATPOOLSUB5.SYS'
CATEGORY CUTDUM2
MODEL CUTDUM2
TREE SECUR1
BUILD

The TREE command produces a file called SECUR1.TR1.

Suppose the analyst has a new data set called DATANEW.SYS, which contains the characteristics of new cases with an unknown class distribution. The analyst now wants to run this data down the saved tree (SECUR1.TR1) to find out the classes into witch the new cases fall, and to save the case-bye-case results in a data file called PREDCT.SYS. Using the CASE command line, this is written as follows:

USE 'D:DATANEW.SYS'
TREE SECUR1
SAVE PREDECT/SINGLE
IDVAR HHID
CASE

The IDVAR command line adds the identification variable (HHID) to the file PREDCT.SYS, which is created by the CASE command. The contents of the PREDCT.SYS file include the original variables used in the model and a few new variables created by CART. The RESPONSE and CORRECT variables are the most useful of the new variables. The RESPONSE variable contains the class assigned to an observation by CART. The CORRECT variable is an indicator variable. It equals 1 for correct prediction and 0 for incorrect prediction.