Cover Image
close this bookStrengthening Policy Analysis - Econometric tests using microcomputer software + disk (IFPRI, 1995, 166 p.)
View the document(introduction...)
View the documentPreface
View the documentAcknowledgments
View the document1. Introduction
View the document2. Software Information
View the document3. Data Handling
View the document4. Specification Tests
View the document5. Deficient Data Problems
View the documentAppendix 1:SPSS/PC+ Environment and Commands
View the documentAppendix 2: SAS PC Environment and Commands
View the documentAppendix 3: Gauss-386 Environment and Commands
View the documentBibliography

4. Specification Tests

TESTS FOR HETEROSKEDASTICITY

The major consequence of heteroskedasticity (nonconstant variance of the stochastic disturbance term) is that it causes the OLS estimate of the stochastic error variance (>2) to be biased, rendering hypothesis tests on coefficients invalid. Most tests for heteroskedasticity involve examining the regression residuals; the White test involves comparison of the OLS coefficient covariance matrix with a heteroskedasticity-consistent covariance matrix. The Goldfeld-Quandt, Breusch-Pagan, and White tests are described below. These tests are quite general. The White test is the most general in the sense that it requires no specification of a model of the heteroskedastic error-generating process. The Goldfeld-Quandt test requires only that the heteroskedasticity be related to one of the regressors; the Breusch-Pagan test requires that it be related to some set of regressors. If heteroskedasticity is detected, the usual practice is to specify a model by which the standard deviation of the stochastic disturbance can be estimated at each observation, then used in a “weighted least-squares” procedure. White’s method produces an estimate of the variance-covariance matrix of coefficients that is consistent in the presence of heteroskedasticity so that tests on the OLS coefficients may be conducted. See the references for details.

The model is the usual one:

y = Xb + e.

The hypothesis to be tested is as follows:


(constant variance—no heteroskedasticity);


(heteroskedasticity).

Goldfeld-Quandt Test

This older test is only applicable when there is a strong a priori reason to believe that the variance of the error term is explicitly related to one of the explanatory variables, say Xk. This test comprises the following steps:

Step 1

Reorder the data by magnitude of the observations on Xk, from smallest to largest.

Step 2

Partition the ordered data set into three subsets, each of size C = N/3. Delete the middle subset, then denote the subset with small values of Xk as set 1 and the subset with large values of Xk as set 2.

Step 3

Perform OLS (using all of the regressors in X) on set 1 and set 2 separately and get the residual sum of squares (RSS) from each set.

Step 4

If set 2 has the higher RSS, the estimated variance of the residuals is positively correlated with the size of Xk. Calculate


= RSS2/RSS1. If Set 1 has the higher RSS (negative correlation between X and the estimated variance of the residuals), then calculate

= RSS1/RSS2. The test statistic is

~ F[(N-C-2K)/2,(N-C-2K)/2].

Compare to standard F-table; if


> Fcritical at the desired level of significance, then reject H0 of homoskedasticity.

The GAUSS-386 program (Figure 10) produces a Goldfeld-Quandt-statistic of 1.5164, with 541 numerator degrees of freedom and 542 denominator degrees of freedom. The P-value is 0.0000, indicating a strong rejection of the hypothesis of no heteroskedasticity. The SAS PC (Figure 11) and SPSS/PC+ (Figure 12) F-statistics differ slightly (although not enough to alter the conclusions,


= 1.4972) because the programs select slightly different numbers of observations for the lower-and upper-thirds of the data set. The sample programs for this section use the same basic model that will be used in subsequent sections. It is assumed that the error variance is monotonically related to variable X10.

NOTE: Some authors recommend using relatively large significance levels (say, 25 percent to 50 percent) for tests of heteroskedasticity such as the Goldfeld-Quandt test since its consequences are severe and consistent estimators are readily available.

Recommended References: Fomby, Hill, and Johnson (1984, 193-194); Goldfeld and Quandt (1965, 539-547); Greene (1990, 420); Griffiths, Hill, and Judge (1993, 498-499); Judge et al. (1984, 449); Kennedy (1985, 97; 1992, 118); Kmenta (1986, 292-294); Maddala (1988, 164).

Figure 10 - Sample program for Goldfeld-Quandt test, in GAUSS-386

/*****************************************************************
* PROGRAM: GQTEST.G SOFTWARE: GAUSS-386 V3.0
* FILENAME DESCRIPTION
* INPUTS: DATA.DAT GAUSS-386 DATA SET
* PURPOSE: PERFORM THE GOLDFELD-QUANDT TEST.
*****************************************************************/

FORMAT /M2 /RD 12,4;
OUTPUT FILE = GQTEST.OUT RESET;

NAMES = GETNAME("DATA");
OPEN D = DATA VARINDXI;
NCASE = ROWSF(D);
DATA = READR(D,NCASE);
F = CLOSE(D);

@------- ASSUME THAT HETEROSKEDASTICITY IS RELATED TO X10 --------@
@------- AND SORT ENTIRE DATA SET ACCORDINGLY --------@

DATA = SORTC(DATA,IX10);
Y = DATA[.,IY1];
X = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3];

NAMES = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.];

@-------- CHOOSE LOWER-THIRD AND UPPER-THIRD DATA SUBSETS --------@

NL = FLOOR(NCASE/3);
YL = Y[1:NL,.];
XL = X[1:NL,.];
NL = ROWS(XL);

NU = FLOOR(2*NCASE/3) + 1;
YU = Y[NU:NCASE,.];
XU = X[NU:NCASE,.];
NU = ROWS(XU);

@-------- OLS REGRESSIONS ON DATA SUBSETS --------@

K = COLS(XL);
BL = INV(XL'XL)*XL'YL; @ BETAS @
E = YL- XL*BL; @ RESIDUALS @
RSSL = E'E; @ RESIDUAL SUM OF SQUARES @
SER = SQRT(INV(NL-K)*RSSL); @ STD ERR OF REGRESSION @
RSQ = 1 - RSSL/((NL-1)*(STDC(YL))^2); @ R-SQUARED @
COV = INV(NL-K)*RSSL*INV(XL'XL); @ COV MATRIX OF BETAS @
SE = SQRT(DIAG(COV)); @ STD ERRS OF BETAS @
T = BL ./ SE; @ T-STATISTICS @
PT = 2*CDFTC(ABS(T),(NL-K)); @ P-VALUES @
PRN = BL ~ SE ~ T ~ PT; @ FOR PRINTING @

" ";
" ";
" ";
" OLS RESULTS FOR LOWER DATA SUBSET ";
" ";
" ";
" NUMBER OF OBSERVATIONS = ";; NL;
" " ;
" STANDARD ERROR OF REGRESSION = ";; SER;
" ";
" RESIDUAL SUM OF SQUARES = ";; RSSL;
" ";
" R-SQUARED = ";; RSQ;
" ";
" ";
" VARIABLE COEFF STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];
I = 1;
DO WHILE I <= K-1;

FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];
I = I + 1;

ENDO;
" ";
"f";

K = COLS(XU);
BU = INV(XU'XU)*XU'YU; @ BETAS @
E = YU - XU*BU; @ RESIDUALS @
RSSU = E'E; @ RESIDUAL SUM OF SQUARES @
SER = SQRT(INV(NU-K)*RSSU); @ STD ERR OF REGRESSION @
RSQ = 1 - RSSU/((NU-1)*(STDC(YU))^2); @ R-SQUARED @
COV = INV(NU-K)*RSSU*INV(XU'XU); @ COV MATRIX OF BETAS @
SE = SQRT(DIAG(COV)); @ STD ERRS OF BETAS @
T = BU ./ SE; @ T-STATISTICS @
PT = 2*CDFTC(ABS(T),(NU-K)); @ P-VALUE @
PRN = BU ~ SE ~ T ~ PT; @ FOR PRINTING @

" ";
" ";
" ";
" OLS RESULTS FOR UPPER DATA SUBSET ";
" ";
" ";
" NUMBER OF OBSERVATIONS = ";; NU;
" " ;
" STANDARD ERROR OF REGRESSION = ";; SER;
" ";
" RESIDUAL SUM OF SQUARES = ";; RSSU;
" ";
" R-SQUARED = ";; RSQ;
" ";
" ";
" VARIABLE COEFF STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K-1;

FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;
" ";
" ";

@-------- CALCULATION OF G/Q TEST STATISTIC --------@

IF RSSL <= RSSU;

F = RSSU/RSSL;
NDF = NU-K;
DDF = NL-K;

ELSE;
F = RSSL/RSSU;
NDF = NL-K;
DDF = NU-K;

ENDIF;
PROB = CDFFC(F,NDF,DDF);

" GOLDFELD/QUANDT RESULTS ";
" ";
" ";
" NUMBER OF OBSERVATIONS IN LOWER DATA SET =";; NL;
" NUMBER OF OBSERVATIONS IN UPPER DATA SET =";; NU;
" ";
" RESIDUAL SUM OF SQUARES FOR LOWER REGRESSION =";; RSSL;
" RESIDUAL SUM OF SQUARES FOR UPPER REGRESSION =";; RSSU;
" ";
" G/Q F-STATISTIC = ";; F;; " P-VALUE =";; PROB;
" ";

"f";

OUTPUT FILE = GQTEST.OUT OFF;
SYSTEM;

Figure 11 - Sample program for GoIdfeld-Quandt test, in SAS PC

*************************************************************
* PROGRAM: GQTEST.SAS SOFTWARE: SAS PC 6.04
* FILENAME DESCRIPTION
* INPUTS: DATA.SSD TEST DATA SET
* PURPOSE: PERFORM GOLDFELD-QUANDT TEST.
************************************************************;

LIBNAME CDRV 'C:DATA';

* WE SUSPECT THAT THE VARIANCE OF THE DISTURBANCE TERM IS RELATED TO X10;

* PROC RANK CREATES A NEW VARIABLE (RX10) WITH VALUES OF 0, 1, OR 2
* CORRESPONDING TO THREE EQUAL GROUPS;

PROC RANK DATA=CDRV.DATA OUT=DRANK GROUP=3;

VAR X10;
RANKS RX10;

RUN;

PROC REG DATA=DRANK;

WHERE RX10 = 0;
MODEL Y1=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3;

RUN;

PROC REG DATA=DRANK;

WHERE RX10 = 2;
MODEL Y1=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3;

RUN;

* TEST STATISTIC CALCULATION FROM OUTPUT;
* RSS1 = RESIDUAL SUM OF SQUARES FROM THE FIRST REGRESSION;
* RSS2 = RESIDUAL SUM OF SQUARES FROM THE SECOND REGRESSION;
* CONSTRUCT F = RSS1/RSS2 IF RSS1>RSS2, OR F = RSS2/RSS1 IF RSS2>RSS1;
* DEGREES OF FREEDOM = ((N-C-2*K)/2), ((N-C-2*K)/2);
* N=NUMBER OF OBSERVATIONS (1624), C = MIDDLE THIRD OF OBSERVATIONS
* DROPPED (538);
* K=NUMBER OF PARAMETERS IN MODEL (19). FOR THIS EXAMPLE, F=1.4972, AND THE
* NULL HYPOTHESIS OF HOMOSCEDASTICITY (WITH RESPECT TO X10) IS REJECTED;

Figure 12 - Sample program for Goldfeld-Quandt test, in SPSS/PC+

SET MORE OFF.
SET LIS = 'GQTEST.LIS'.
SET LOG = 'GQTEST.LOG'.
****************************************************************
* PROGRAM: GQTEST.SPS SOFTWARE: SPSS/PC+ 4.01
* FILENAME DESCRIPTION
* INPUTS: DATA.SYS TEST DATA SET
* PURPOSE: PERFORM GOLDFELD-QUANDT TEST.
***************************************************************.

GET FILE = 'DATA.SYS' .

* WE SUSPECT THAT THE VARIANCE OF THE DISTURBANCE TERM IS RELATED TO X10.

* RANK CREATES A NEW VARIABLE (RX10) WITH VALUES OF 1, 2, OR 3
* CORRESPONDING TO THREE EQUAL GROUPS.

RANK X10/NTILE (3) INTO RX10.

PROCESS IF ( RX10 = 1 ).
REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6

D7 D8 RD1 RD2 RD3
/DEPENDENT=Y1
/METHOD=ENTER.

PROCESS IF ( RX10 = 3 ).
REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6

D7 D8 RD1 RD2 RD3
/DEPENDENT=Y1
/METHOD=ENTER.

* TEST STATISTIC CALCULATION FROM OUTPUT.
* RSS1 = RESIDUAL SUM OF SQUARES FROM THE FIRST REGRESSION.
* RSS2 = RESIDUAL SUM OF SQUARES FROM THE SECOND REGRESSION.
* CONSTRUCT F = RSS1/RSS2 IF RSS1>RSS2, OR F = RSS2/RSS1 IF RSS2>RSS1.
* DEGREES OF FREEDOM = ((N-C-2*K)/2), ((N-C-2*K)/2).
* N=NUMBER OF OBSERVATIONS (1624), C = MIDDLE THIRD OF OBSERVATIONS
* DROPPED (538).
* K=NUMBER OF PARAMETERS IN MODEL (19); FOR THIS EXAMPLE, F=1.4972, AND THE
* NULL HYPOTHESIS OF HOMOSCEDASTICITY (WITH RESPECT TO X10) IS REJECTED.
FINISH.

Breusch-Pagan Test

This test assumes that the disturbance terms, ei, are normally and independently distributed. Moreover, the variances of ei are assumed to be of the form >2=f(Za), where Z is a set of p variables (these may be a subset of the X variables) thought to influence the heteroskedasticity (Z also includes a constant term) and a is a conformable vector of coefficients. This test does not depend on the functional form of f. The test evaluates whether the variables in Z have explanatory power for the variation in squared standardized residuals from the original model.

The model is the usual one:

y=Xb + e

The Breusch-Pagan test follows the following steps:

Step 1

Estimate the model by OLS and save the vector of residuals e.

Step 2

Compute


and the N × 1 vector v, where vi =

.

Step 3

Specify the variables in Z, regress v on Z, and compute the explained sum of squares (ESS, sometimes called the regression or model sum of squares).

Step 4

Calculate the statistic Q = ESS/2. Q is asymptotically chi-squared (c2) with (p - 1) degrees of freedom.

Step 5

Compare Q to


value at the desired level of significance. If Q>

, then reject H0.

In the sample programs (Figures 13 through 15), the same model is used as before and the variables in Z are selected to be identical with those in X. The Q value is 68.1561 (P-value = 0.0000) and again, the hypothesis of no heteroskedasticity is strongly rejected.

NOTE: As with the Goldfeld-Quandt test, some writers recommend using relatively large significance levels for the Breusch-Pagan test.

Recommended references: Breusch and Pagan (1979, 1287-1294); Fomby, Hill, and Johnson (1984, 195-196); Greene (1990, 421-422); Griffiths, Hill, and Judge (1993, 498-500); Judge et al. (1984, 446-447); Kennedy (1985, 97-98, 108; 1992, 118, 130-131); Kmenta (1986, 294-295); Maddala (1988, 164).

Figure 13 - Sample program for Breusch-Pagan test, in GAUSS-386

/*****************************************************************
* PROGRAM: BPTEST.G SOFTWARE: GAUSS-386 V3.0
* FILENAME DESCRIPTION
* INPUTS: DATA.DAT TEST DATA SET
* PURPOSE: PERFORM BREUSCH-PAGAN TEST
*****************************************************************/

FORMAT /M2 /RD 12,4;
OUTPUT FILE = BPTEST.OUT RESET;

NAMES = GETNAME("DATA");
OPEN D = DATA VARINDXI;
NCASE = ROWSF(D);
DATA = READR(D,NCASE);
F = CLOSE(D);
Y = DATA[.,IY1];
X = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3];

NAMES = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.];

K = COLS(X);

@-------- OLS REGRESSION --------@

B = INV(X'X)*X'Y; @ OLS BETAS @
E = Y - X*B; @ OLS RESIDUALS @
RSS = E'E; @ RESIDUAL SUM SQUARES @
SER = SQRT(INV(NCASE - K)*RSS); @ S.E. OF REGRESSION @
RSQ = 1 - RSS/((NCASE-1)*(STDC(Y))^2); @ R-SQUARED @
COV = INV(NCASE-K)*RSS*INV(X'X); @ VAR-COV MATRIX OF B @
SE = SQRT(DIAG(COV)); @ S.E. OF B ELEMENTS @
T = B ./ SE; @ T-STATISTICS @
PT = 2*CDFTC(ABS(T),(NCASE-K)); @ P-VALUES @
PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @

@-------- PRINT OLS RESULTS --------@

" ";
" ";
" ";
" OLS RESULTS";
" ";
" NUMBER OF OBSERVATIONS = ";; NCASE;
" STANDARD ERROR OF REGRESSION = ";; SER;
" RESIDUAL SUM OF SQUARES = ";; RSS;
" R-SQUARED = ";; RSQ;
" ";
" ";
" VARIABLE COEFF STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K-1;

FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;
ENDO;

"f";

@-------- CONSTRUCTION OF STANDARDIZED SQUARED RESIDUALS --------@

G = (E .^ 2)/(INV(NCASE)*E'E);

@-------- CHOOSE REGRESSORS THAT EXPLAIN HETEROSKEDASTICITY --------@
@-------- A COMMON CHOICE IS Z = X --------@

Z = X;
K = COLS(Z);
D = INV(Z'Z)*Z'G; @ B-P COEFFICIENTS @
E = G - Z*D; @ RESIDUALS FROM AUX REG @
RSS = E'E; @ B-P REG RSS @
COV = INV(NCASE-K)*RSS*INV(Z'Z); @ COV MATRIX FOR D COEFFS @
SE = SQRT(DIAG(COV)); @ S.E. OF D ELEMENTS @
T = D ./ SE; @ T-STATISTICS FOR D @
PT = 2*CDFTC(ABS(T),(NCASE-K)); @ P-VALUES @
PRN = D ~ SE ~ T ~ PT; @ FOR PRINTING @

GHAT = Z*D; @ FITTED STANDARDIZED @
@ SQUARED RESIDUALS @

ESS = SUMC( (GHAT - MEANC(GHAT))^2 ); @ ESS FROM B-P REGRESSION @
Q = ESS/2; @ B-P TEST STATISTIC @
PCHI = CDFCHIC(Q,K); @ P-VALUE FOR Q @

@-------- PRINT B-P REGRESSION AND B-P TEST STATISTIC --------@
" ";
" ";
" ";
" AUXILIARY B-P REGRESSION RESULTS";
" ";
" NUMBER OF OBSERVATIONS = ";; NCASE;
" EXPLAINED SUM-OF-SQUARES = ";; ESS;
" ";
" VARIABLE COEFF STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K-1;

FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;

" ";
" ";
" BREUSCH-PAGAN TEST STATISTIC: Q = ";; Q;
" ";
" DEGREES OF FREEDOM = ";; K;
" ";
" P-VALUE =";; PCHI;

"f";

OUTPUT FILE = BPTEST.OUT OFF;
SYSTEM;

Figure 14 - Sample program for Breusch-Pagan test, in SAS PC

*************************************************************
* PROGRAM: BPTEST.SAS SOFTWARE: SAS PC 6.04
* FILENAME DESCRIPTION
* INPUTS: DATA.SSD TEST DATA SET
* PURPOSE: PERFORM BREUSCH-PAGAN TEST.
************************************************************;

LIBNAME CDRV 'C:DATA';

* VARIANCE OF DISTURBANCE TERM THOUGHT TO BE RELATED TO ALL
* EXPLANATORY VARIABLES;

PROC REG DATA=CDRV.DATA;

MODEL Y1=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3;
OUTPUT OUT=PRED R=RES;

RUN;

DATA E2;

SET PRED;
E2 = RES**2;
CONSTANT = 1;

RUN;

PROC SUMMARY DATA=E2;

VAR E2;
ID CONSTANT;
OUTPUT OUT=MEANE2 MEAN=MEANE2;

RUN;

DATA G;

MERGE E2 MEANE2;
BY CONSTANT;
G = E2/MEANE2;

RUN;

PROC REG DATA=G;

MODEL G=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3;

RUN;

* TEST STATISTIC CALCULATION FROM OUTPUT;
* FROM THIS REGRESSION GET THE EXPLAINED SUM OF SQUARES (ESS);
* (SOMETIMES CALLED THE REGRESSION OR MODEL SUM OF SQUARES) ;
* THEN ESS/2 IS DISTRIBUTED CHI-SQUARED WITH P-1 DEGREES;
* OF FREEDOM, SO COMPARE TO CHI-SQUARED CRITICAL TABLE.;
* FOR THIS EXAMPLE, P=19 AND THE CRITICAL VALUE =28.869.;
* FOR THIS EXAMPLE, CHI-SQUARED TEST STATISTIC = 68.156.;
* REJECT NULL HYPOTHESIS OF NO HETEROSKEDASTICITY.;

Figure 15 - Sample program for Breusch-Pagan test, in SPSS/PC+

SET MORE = OFF.
SET LIS = 'BPTEST.LIS'.
SET LOG = 'BPTEST.LOG'.
******************************************************************
* PROGRAM: BPTEST.SPS SOFTWARE: SPSS/PC+ 4.01
* FILENAME DESCRIPTION
* INPUTS: DATA.SYS TEST DATA SET
* PURPOSE: PERFORM BREUSCH-PAGAN TEST.
*****************************************************************.

GET FILE = 'DATA.SYS' .

* VARIANCE OF DISTURBANCE TERM THOUGHT TO BE RELATED TO ALL
* EXPLANATORY VARIABLES.

REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6

D7 D8 RD1 RD2 RD3
/DEPENDENT=Y1
/METHOD=ENTER
/SAVE = RESID(RES).

COMPUTE E2 = RES**2.
COMPUTE CONSTANT = 1.
SAVE OUT='E2.SYS'.

AGGREGATE OUTFILE='MEANE2.SYS'

/BREAK=CONSTANT
/MEANE2=MEAN(E2).

JOIN MATCH FILE='E2.SYS'

/TABLE='MEANE2.SYS'
/BY CONSTANT.

COMPUTE G=(E2/MEANE2).

REGRESSION VARIABLES = G X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6

D7 D8 RD1 RD2 RD3
/DEPENDENT=G
/METHOD=ENTER.

* TEST STATISTIC CALCULATION FROM OUTPUT.
* FROM THIS REGRESSION GET THE EXPLAINED SUM OF SQUARES (ESS) (SOMETIMES
* CALLED THE REGRESSION OR MODEL SUM OF SQUARES).
* THEN ESS/2 IS DISTRIBUTED CHI-SQUARED WITH P-1 DEGREES.
* OF FREEDOM, SO COMPARE TO CHI-SQUARED CRITICAL TABLE.
* FOR THIS EXAMPLE, P=19 AND THE CHI-SQUARED CRITICAL VALUE = 28.869.
* FOR THIS EXAMPLE, CHI-SQUARED TEST = 68.156.
* REJECT NULL HYPOTHESIS OF HOMOSCEDASTICITY.
FINISH.

The White Test

The presence of heteroskedasticity makes the OLS variance-covariance matrix of coefficients inconsistent. White (1980) introduced an estimated variance-covariance matrix for the OLS coefficients that is consistent under heteroskedasticity. White also introduced a test statistic for heteroskedasticity based on the extent to which the OLS variance-covariance matrix departs from White’s heteroskedasticity-consistent covariance matrix.

One great advantage of White’s procedure is that it produces an estimator for the variance-covariance matrix of coefficients that is consistent in the presence of heteroskedasticity, so that tests regarding the coefficients can be conducted without having to first correct for the heteroskedasticity. However, the White test may not be as powerful as some alternative tests that use more specific information about the form of the heteroskedasticity.

White’s heteroskedasticity-consistent covariance matrix and the original form of his test are clearly laid out in several references, including those listed below. The test using the full set of explanatory variables is only presented in GAUSS-386 (Figure 16). This is because it is quite time-consuming to compute White’s test manually in SPSS/PC+ and SAS PC for anything but a small set of explanatory variables (see Figures 17 and 18). While SAS PC has options for computing the heteroskedasticity-consistent covariance matrix and White’s test automatically (ACOV SPEC), the SPEC algorithm appears to have a bug that a patch could not completely correct.

The GAUSS-386 program has two parts: first, the heteroskedasticity-consistent covariance matrix is computed, then the test for heteroskedasticity is conducted. Note that the investigator must be vigilant to avoid introducing redundancies among the constructed regressors for this test, especially if dummy variables are present.

The procedure for computing the test is as follows:

Step 1

Perform ordinary least squares on the model, save the residual vector e, and construct an N × 1 vector of squared residuals, e2.

Step 2

Compute the squares and cross-products of all regressors, deleting all redundancies. The obvious redundancies are those produced by the constant term and dummy variables. Your final set of regressors should include the original variables and all nonredundant squares and cross-products.

Step 3

Regress the squared residuals, e2, on the regressors from step 2, using OLS. Retain the R2 from this auxiliary regression.

Step 4

Compute the test statistic, W = N × R2.

Step 5

W will be asymptotically distributed c 2, with degrees of freedom equal to the number of regressors in step 3. If W >


the null hypothesis of no heteroskedasticity is rejected.

The sample GAUSS-386 program produces the White heteroskedasticity-consistent covariance matrix. Notice that the square roots of its diagonal elements are quite different from the OLS standard errors; it is expected that a formal test of the differences will find them significant. The test statistic, W, is 192.384 (df = 183). The null hypothesis of no heteroskedasticity is rejected.

The SAS PC and SPSS/PC+ programs for the reduced explanatory variable set produce a test statistic, W = 24.3031 (df = 120). Again, the null hypothesis of no heteroskedasticity is rejected.

Recommended references: Fomby, Hill, and Johnson (1984, 196); Greene (1990, 403-104); Kennedy (1985; 98, 108; 1992, 90, 118, 130-131); Kmenta (1986, 295-296); Maddala (1988, 162); Messer and White (1984, 181-184); White (1980, 817-838).

Figure 16 - Sample program for White test, in GAUSS-386

/*********************************************************************************
* PROGRAM: WHITE.G SOFTWARE: GAUSS-386 V3.0
* FILENAME DESCRIPTION
* CONSISTENT DATA ERRORS
* INPUTS : DATA.DAT GAUSS-386 DATA SET
* PURPOSE: CONSTRUCT ESTIMATES OF VARIANCE-COVARIANCE
* MATRIX THAT ARE CONSISTENT IN PRESENCE OF
* HETEROSKEDASTICITY AND DO WHITE TEST.
* THIS PROGRAM RUNS ABOUT AN HOUR ON A 386-
* 25 MHZ MACHINE WITH 4 MB RAM. IT USES
* EXTENDED MEMORY EXTENSIVELY, HENCE ITS
* LONG RUN TIME.
*********************************************************************************/

FORMAT /M2 /RD 12,4;
OUTPUT FILE = WHITE.OUT RESET;
NAMES = GETNAME("DATA");
OPEN D = DATA VARINDXI;
NCASE = ROWSF(D);
DATA = READR(D,NCASE);
F = CLOSE(D);

Y = DATA[.,IY1];

X = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3];

NAMES = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.];

@-------- OLS ESTIMATION --------@

K = COLS(X);

B = INV(X'X)*X'Y; @ BETAS @
E = Y - X*B; @ RESIDUALS @
RSS = E'E; @ RESIDUAL SUM OF SQ @
SER = SQRT(INV(NCASE-K)*RSS); @ STD ERR OF REGRESSION @
RSQ = 1 - RSS/((NCASE-1)*(STDC(Y))^2); @ R-SQUARED @
OLSC = INV(NCASE-K)*RSS*INV(X'X); @ OLS COV MATRIX @
SE = SQRT(DIAG(OLSC)); @ STD ERRS OF BETAS @
T = B ./ SE; @ T-STATISTICS FOR BETAS @
PT = 2*CDFTC(ABS(T),(NCASE-K)); @ P-VALUES @
PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @

" ";
" OLS RESULTS ";
" ";
" ";
" NUMBER OF OBSERVATIONS = ";; NCASE;
" ";
" STANDARD ERROR OF REGRESSION = ";; SER;
" ";
" RESIDUAL SUM OF SQUARES = ";; RSS;
" ";
" R-SQUARED = ";; RSQ;
" ";
" VARIABLE COEFF STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K - 1;

FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;
ENDO;

" ";

@-------- SQUARE THE RESIDUAL FOR EACH OBSERVATION. --------@

S = E.^2;

@-------- HETEROSKEDASTICITY-CONSISTENT COVARIANCE MATRIX --------@

HETCM = ZEROS(K,K);

I = 1;

DO WHILE I <= NCASE;

HETCT = S[I,.].*((X[I,.])'(X[I,.]));

HETCM = HETCM + HETCT;

I = I + 1;

ENDO;

HETC = INV(X'X)*HETCM*INV(X'X);

HSE = SQRT(DIAG(HETC));

" ";
" ";
" HETEROSKEDASTICITY-CONSISTENT STD ERRS"; HSE;

" ";

CLEAR HSE HETC HETCM HETCT PRN PT T SE OLSC RSQ SER RSS E B Y;

@-------- CONSTRUCT VARIABLES FOR WHITE-AUGMENTED --------@
@-------- REGRESSION. NOTE THAT REDUNDANCIES --------@
@-------- ARE AVOIDED BY FIRST CONSTRUCTING THE --------@
@-------- SQUARES AND CROSS-PRODUCTS OF ALL NON-DUMMY --------@
@-------- VARIABLES, THEN CONCATENATING THE DUMMIES AND --------@
@-------- THEIR INTERACTIONS WITH THE REGULAR REGRESSORS. --------@

X = X[.,1:(K-3)];

K = COLS(X);

AUGX = X;

I = 2; OUTPUT FILE = WHITE.OUT OFF;

DO WHILE I <= K;

AUGX = AUGX ~ (X[.,I] .* X[.,I:K]);

"LOOP =";; I;

I = I + 1;

ENDO; OUTPUT FILE = WHITE.OUT ON;

W = AUGX ~ (DATA[.,IRD1] .* X)

~ (DATA[.,IRD2] .* X)
~ (DATA[.,IRD3] .* X);

CLEAR AUGX X DATA;

K = COLS(W);

D = INV(W'W)*W'S;

ES = S - W*D;

CLEAR W;

RSSW = ES'ES;

RSQW = 1 - RSSW/((NCASE-1)*(STDC(S))^2);

DF = K - 1;

WTEST = NCASE*RSQW;

PW = CDFCHIC(WTEST,DF);
" ";
"WTEST =";; WTEST;; " DF =";; DF;; " P-VALUE =";; PW;

"f";

OUTPUT FILE = WHITE.OUT OFF;
SYSTEM;

Figure 17 - Sample program for White test, in SAS PC

******************************************************************
* PROGRAM: WHITE.SAS SOFTWARE: SAS PC 6.04
* FILENAME DESCRIPTION
* INPUTS: DATA.SSD TEST DATA SET
* PURPOSE: CONSTRUCT ESTIMATES OF VARIANCE-
* COVARIANCE MATRIX THAT ARE CONSISTENT
* IN PRESENCE OF HETEROSKEDASTICITY, WITH
* A REDUCED SET OF EXPLANATORY VARIABLES
*****************************************************************;

LIBNAME CDRV 'C:DATA';

PROC REG DATA=CDRV.DATA;

MODEL Y1=X1 X2 X9 X10;
OUTPUT OUT=RDATA R=RES;

RUN;

DATA XRDATA;

SET RDATA;
RESSQ = RES**2;

* EACH VARIABLE SQUARED;

ZX1 = X1**2;
ZX2 = X2**2;
ZX9 = X9**2;
ZX10 = X10**2;

* INTERACTION WITH X1;

X1X2 = X1*X2;
X1X9 = X1*X9;
X1X10 = X1*X10;

* INTERACTION WITH X2;

X2X9 = X2*X9;
X2X10 = X2*X10;

* INTERACTION WITH X9;

X9X10 = X9*X10;

RUN;

PROC REG DATA=XRDATA;

MODEL RESSQ=X1 X2 X9 X10

ZX1 ZX2 ZX9 ZX10
X1X2 X1X9 X1X10
X2X9 X2X10
X9X10;

RUN;

* TEST STATISTIC CALCULATION FROM OUTPUT;
* THE WALD TEST STATISTIC, W, EQUALS R-SQUARED FROM THE SECOND REGRESSION
* (WHICH CONTAINS THE TRANSFORMATIONS OF X1, X2, X9, AND X10)
* MULTIPLIED BY THE NUMBER OF OBSERVATIONS USED IN THE REGRESSION. W IS
* DISTRIBUTED AS CHI-SQUARED WITH K(K+1)/2 DEGREES OF FREEDOM. IF W
* IS GREATER THAN THE CRITICAL CHI-SQUARED VALUE, THEN THE NULL HYPOTHESIS
* OF HOMOSCEDASTICITY IS REJECTED.;

* FOR THIS EXAMPLE N=1624, K=15, R-SQ= 0.01496, AND W=24.295. K=120. THE
* NULL HYPOTHESIS OF HOMOSCEDASTICITY IS REJECTED.;

Figure 18 - Sample program for White test, in SPSS/PC+

SET MORE=OFF.
SET LIS = 'WHITE.LIS'.
SET LOG = 'WHITE.LOG'.
******************************************************************
* PROGRAM: WHITE.SPS SOFTWARE: SPSS/PC+ 4.01
* FILENAME DESCRIPTION
* INPUTS: DATA.SYS TEST DATA SET
* PURPOSE: CONSTRUCT ESTIMATES OF VARIANCE-
* COVARIANCE MATRIX THAT ARE CONSISTENT
* IN PRESENCE OF HETEROSKEDASTICITY, WITH
* A REDUCED SET OF EXPLANATORY VARIABLES
*****************************************************************.

GET FILE = 'DATA.SYS' .
REGRESSION VARIABLES =

Y1 X1 X2 X9 X10
/DEPENDENT=Y1
/METHOD=ENTER
/SAVE RESID(RES).

COMPUTE RESSQ = RES**2.

* EACH VARIABLE SQUARED.
COMPUTE ZX1 = X1**2.
COMPUTE ZX2 = X2**2.
COMPUTE ZX9 = X9**2.
COMPUTE ZX10 = X10**2.

* INTERACTION WITH X1.
COMPUTE X1X2 = X1*X2.
COMPUTE X1X9 = X1*X9.
COMPUTE X1X10 = X1*X10.

* INTERACTION WITH X2.
COMPUTE X2X9 = X2*X9.
COMPUTE X2X10 = X2*X10.

* INTERACTION WITH X9.
COMPUTE X9X10 = X9*X10.

REGRESSION VARIABLES =

RESSQ X1 X2 X9 X10
ZX1 ZX2 ZX9 ZX10
X1X2 X1X9 X1X10
X2X9 X2X10
X9X10
/DEPENDENT=RESSQ
/METHOD=ENTER.

* TEST STATISTIC CALCULATION FROM OUTPUT.
* THE WALD TEST STATISTIC, W, EQUALS R-SQUARED FROM THE SECOND REGRESSION
* (WHICH CONTAINS THE TRANSFORMATIONS OF X1, X2, X9, AND X10)
* MULTIPLIED BY THE NUMBER OF OBSERVATIONS USED IN THE REGRESSION. W IS
* DISTRIBUTED AS CHI-SQUARED WITH K(K+1)/2 DEGREES OF FREEDOM. IF W
* IS GREATER THAN THE CRITICAL CHI-SQUARED VALUE, THEN THE NULL HYPOTHESIS
* OF HOMOSKEDASTICITY IS REJECTED.

* FOR THIS EXAMPLE N=1624, R-SQ= 0.01496, K=15, AND W=24.295. K=120. THE
* NULL HYPOTHESIS OF HOMOSKEDASTICITY IS REJECTED.
FINISH.

NORMALITY OF RESIDUALS : THE JARQUE-BERA TEST

If the elements of the disturbance vector are not normally distributed, the OLS estimators for b are still best linear unbiased, but the usual t- tests and F-tests are no longer appropriate, and appropriate asymptotically justified tests should be used.

The Jarque-Bera test checks whether the skewness (symmetry) and kurtosis (fatness of tails) of the distribution of residuals matches the skewness and kurtosis expected under the null hypothesis that the disturbances are normally distributed. Skewness is measured by Öb1 = µ3


and kurtosis is measured by b2 = µ4/µ

where estimates of the moments µr are given by 1/NS

(r = 2, 3, 4). Under the null hypothesis that the disturbances are normally distributed, b1 = 0 and b2 = 3. Thus, the null hypothesis is

H0: b1 = 0 and b2 = 3.

The alternative hypothesis is that the disturbances are not normal and belong to a class of distributions called the “Pearson family.” The test statistic is

h=N[(z1/6) + (z2-3)2/24],

where z1 and z2 are the estimates of b1 and b2, and N is the number of observations. h has a


2 distribution with 2 degrees of freedom. Note that h = 0 if z1 = 0 and z2 = 3.

Construction of the test proceeds by the following steps:

Step 1

Estimate the model by OLS and save the residual vector, e.

Step 2

Calculate the sample estimates of the second, third, and fourth moments of the residuals about their mean (which is zero by construction):


(r=2,3,4),

where µr is the rth moment about the mean and the ei’s are the OLS residuals. Denote these as


,

,

respectively.

Step 3

Calculate z1 =


/Ö

and z2 =

/

.

Step 4

Calculate h and compare to the critical value at desired level of significance with two degrees of freedom. If h >


, then reject the null hypothesis. This would imply that the disturbance terms are not normally distributed.

For this model, the Jarque-Bera test statistic is 274.2360 (P-value = 0.0000) and the null hypothesis of normality of disturbance terms is rejected.

Figures 19, 20, and 21 are sample programs for the Jarque-Bera test, in GAUSS-386, SAS PC, and SPSS/PC+, respectively.

NOTE: For an additional normality test, see Shapiro and Wilks (1965) and Shapiro, Wilks, and Chen (1968).

Recommended references: Bowman and Shenton (1975, 243-250); Jarque and Bera (1981); Kennedy (1992, 79); Kmenta (1986, 260-267).

Figure 19 - Sample program for Jarque-Bera test, in GAUSS-386

/***************************************************************************
* PROGRAM: JBTEST.G SOFTWARE: GAUSS-386 V3.0
* FILENAME DESCRIPTION
* INPUTS: DATA.DAT GAUSS-386 DATA SET
* PURPOSE: EXECUTE AND REPORT THE JARQUE-BERA TEST
* FOR NORMALITY OF DISTURBANCES
***************************************************************************/

FORMAT /M2 /RD 12,4;
OUTPUT FILE = JBTEST.OUT RESET;
NAMES = GETNAME("DATA");
OPEN D = DATA VARINDXI;
NCASE = ROWSF(D);
DATA = READR(D,NCASE);
F = CLOSE(D);

Y = DATA[.,IY1];

X = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3];

NAMES = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.];

@-------- OLS ESTIMATION --------@

K = COLS(X);

B = INV(X'X)*X'Y; @ BETAS @
E = Y - X*B; @ RESIDUALS @
RSS = E'E; @ RESIDUAL SUM OF SQ @
SER = SQRT(INV(NCASE-K)*RSS); @ STD ERR OF REGRESSION @
RSQ = 1 - RSS/((NCASE-1)*(STDC(Y))^2); @ R-SQUARED @
COV = INV(NCASE-K)*RSS*INV(X'X); @ OLS COVARIANCE MATRIX @
SE = SQRT(DIAG(COV)); @ STD ERRS OF BETAS @
T = B ./ SE; @ T-STATISTICS FOR BETAS@
PT = 2*CDFTC(ABS(T),(NCASE-K)); @ P-VALUES @
PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @

" ";
" ";
" ";
" OLS RESULTS ";
" ";
" ";
" NUMBER OF OBSERVATIONS = ";; NCASE;
" STANDARD ERROR OF REGRESSION = ";; SER;
" RESIDUAL SUM OF SQUARES = ";; RSS;
" R-SQUARED = ";; RSQ;
" ";
" ";
" VARIABLE COEFF STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K -1;
FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;
" ";

@-------- COMPUTATION OF SECOND, THIRD, AND FOURTH MOMENTS --------@
@-------- OF OLS RESIDUALS --------@

E2 = E^2;
E3 = E^3;
E4 = E^4;

U2 = (SUMC(E2))/NCASE;
U3 = (SUMC(E3))/NCASE;
U4 = (SUMC(E4))/NCASE;

Z1 = (U3/(U2^(3/2)))^2;
Z2 = U4/(U2^2);

ETA = NCASE*((Z1/6) + (((Z2-3)^2)/24));

PCHI = CDFCHIC(ETA,2);

" ";
" JARQUE-BERA STATISTIC ETA =";; ETA;
" ";
" P-VALUE =";; PCHI;

"f";

OUTPUT FILE = JBTEST.OUT OFF;
SYSTEM;

Figure 20 - Sample program for Jarque-Bera test, in SAS PC

****************************************************************************
* PROGRAM: JBTEST.SAS SOFTWARE: SAS PC 6.04
* FILENAME DESCRIPTION
* INPUTS: DATA.SSD TEST DATA SET
* PURPOSE: EXECUTE AND REPORT THE JARQUE-BERA TEST
* FOR NORMALITY OF DISTURBANCES
***************************************************************************;

LIBNAME CDRV 'C:DATA';
PROC REG DATA=CDRV.DATA;

MODEL Y1=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3;
OUTPUT OUT=JARQUE R=RES;

RUN;

DATA JARQUE2;

SET JARQUE;
E2=RES**2;
E3=RES**3;
E4=RES**4;
CONST=1;

RUN;

PROC SUMMARY DATA=JARQUE2;

VAR E2 E3 E4 CONST;
OUTPUT OUT=RESSUM SUM=SUME2 SUME3 SUME4 NCASE;

RUN;

DATA CALC;

SET RESSUM;
MU2=SUME2/NCASE;
MU3=SUME3/NCASE;
MU4=SUME4/NCASE;
Z1 = ((MU3)/(MU2**(3/2)))**2;
Z2 = MU4/(MU2**2);
ETA = NCASE*((Z1/6)+(((Z2-3)**2)/24));

RUN;

PROC PRINT DATA=CALC;

VAR ETA;

RUN;
* TEST STATISTIC CALCULATION FROM OUTPUT.
* ETA IS THE TEST STATISTIC AND IS DISTRIBUTED AS CHI-SQUARED WITH TWO
* DEGREES OF FREEDOM. IF ETA IS GREATER THAN THE CRITICAL CHI-SQUARED
* THEN REJECT THE NULL HYPOTHESIS OF NORMALLY DISTRIBUTED RESIDUALS.
* ETA IN THIS EXAMPLE IS 274.24, WHICH IS LARGER THAN THE CRITICAL CHI-
* SQUARED VALUE. NORMALITY IS REJECTED.;

Figure 21 - Sample program for Jarque-Bera test, in SPSS/PC+

SET MORE = OFF.
SET LIS = 'JBTEST.LIS'.
SET LOG = 'JBTEST.LOG'.
****************************************************************************
* PROGRAM: JBTEST.SPS SOFTWARE: SPSS/PC+ 4.01
* FILENAME DESCRIPTION
* INPUTS: DATA.SYS TEST DATA SET
* PURPOSE: EXECUTE AND REPORT THE JARQUE-BERA TEST
* FOR NORMALITY OF DISTURBANCES
***************************************************************************.

GET FILE = 'DATA.SYS' .
REGRESSION VARIABLES = Y1, X1 X2, X8 X9 X10 X13 X14 X15,

D1 D2 D3 D5 D6 D7 D8, RD1 RD2 RD3
/DEPENDENT=Y1
/METHOD=ENTER
/SAVE RESID(RES).

COMPUTE E2 = RES**2.
COMPUTE E3 = RES**3.
COMPUTE E4 = RES**4.
COMPUTE CONST = 1.
AGGREGATE OUTFILE = *
/BREAK=CONST
/NCASE = NU(RES)
/SUME2 SUME3 SUME4 = SUM(E2 E3 E4).

COMPUTE MU2 = SUME2/NCASE.
COMPUTE MU3 = SUME3/NCASE.
COMPUTE MU4 = SUME4/NCASE.
COMPUTE Z1 = ((MU3)/(MU2**(3/2)))**2.
COMPUTE Z2 = MU4/MU2**2.
COMPUTE ETA = NCASE*((Z1/6)+(((Z2-3)**2)/24)).
LIST ETA.
* TEST STATISTIC CALCULATION FROM OUTPUT.
* ETA IS THE TEST STATISTIC AND IS DISTRIBUTED AS CHI-SQUARED WITH TWO
* DEGREES OF FREEDOM. IF ETA IS GREATER THAN THE CRITICAL CHI-SQUARED
* THEN REJECT THE NULL HYPOTHESIS OF NORMALLY DISTRIBUTED RESIDUALS.
* ETA IN THIS EXAMPLE IS 274.24, WHICH IS LARGER THAN THE CRITICAL CHI-
* SQUARED VALUE. NORMALITY IS REJECTED.
FINISH.

ERRORS IN VARIABLES

A crucial assumption of the classical linear regression model is that the elements of the X matrix of regressors are nonstochastic. If any of the regressors are stochastic, then the problem of simultaneity bias or endogeneity may be faced. One common source of endogeneity is measurement error in the regressors.

There is little doubt that almost all observed variables are measured with error. While the emergence of extensive household surveys represents a wealth of information at the level of the household and individual, the possibility and consequences of measurement error in those data should be considered.

This discussion focuses on the simple linear regression model, that is, the model with a single regressor. The extension to the multiple regression context is straightforward and is illustrated in the sample programs (Figures 22 through 24).

Assume that


(1)

denotes the true model and that both x and y are measured with error. Let the errors be µ and v, respectively. Assume that the errors are normally distributed, with mean zero, and with constant variances so that


,

and


.

Moreover, assume that v and µ are uncorrelated with each other and are uncorrelated with all elements of x. Now write


and


where an asterisk denotes an observed as opposed to a true value. Rewriting equation 1 gives


or


(2)

where


If x is measured with error, then the OLS assumption, cov (w,x*) = 0, is violated because x* and w both contain µ. In fact, the covariance between the stochastic regressors, x*, and the error term is


(see Maddala 1988, 381 for details), and the estimated coefficient on b is biased toward zero.

In the multiple regression framework, the coefficient of the erroneously measured regressor is also biased toward zero. In addition, the coefficients on the remaining regressors are biased, but establishing the signs of the biases is more complicated.

The consequences of measurement error on y as opposed to x are very different. For example, if x is not measured with error, then measurement error in the dependent variable, y, is merely absorbed into the additive error term (e+ v), which does not violate any of the assumptions of the classical OLS model.

Below, two tests that examine the importance of measurement error in regressors are discussed.

The Hausman Test

The Hausman test takes advantage of the instrumental variables (IV) estimator, which (with appropriate instruments) is consistent in the presence of measurement error. Under the null hypothesis of no measurement error, the IV estimator is consistent but inefficient, while OLS is consistent and efficient. The essence of the Hausman test is to determine whether the difference between the OLS and IV estimators is statistically significant.

Now return to the multiple linear regression model,

y=Xb+ e,

and assume that the kth variable in X is measured with error. As a consequence, all elements of the OLS estimator of b are biased.

The Hausman test is implemented by first constructing an IV estimator for the model. The existence of a matrix of L additional regressors that are highly correlated with Xk but uncorrelated with e is assumed. A common method for constructing instruments is to regress the matrix X on a set of regressors Z that includes all variables in X except Xk and all of the additional regressors in L, so that Z has (K + L - 1) regressors. The fitted value of Xk is then used as an instrument for Xk. The columns of X excluding Xk are simply replicated, but the kth column is replaced by fitted values. Call this matrix


. The instrumental variables estimator is then


.

Let


. Then a consistent estimator for the asymptotic variance-covariance matrix is

>2VIV,

where

>2=e’e / (N-K),
with


.

Notice that X is used here rather than


. By comparison, the OLS estimator is

and V0 is defined as (X’X)-1.

The difference between the OLS and IV estimators is defined as


.
Finally, the Hausman statistic is defined:


,

where P>2 may be estimated either from the OLS residuals or from the IV residuals, and where qk is the kth element of q and [VIV - V0]k-1 is the kth diagonal element of [VIV - V0]-1. Many presentations of this test statistic do not indicate that it is constructed with the subvectors and submatrices designated by k. As Griffiths, Hill, and Judge (1993, 476) point out, those presentations assume that Z and X have no columns in common. When they do have columns in common, then the test statistic is constructed with the subvectors and submatrices that correspond with the columns of X not also in Z, namely the kth column that has been replaced by fitted values.

W is asymptotically chi-square, with one degree of freedom. Sample values of W that exceed the selected critical value indicate significant differences between the OLS and IV estimators, hence indicate the presence of measurement error (or other source of endogeneity). Please refer to the references for cases in which more than one regressor is measured with error.

The Hausman test may be implemented in the following steps:

Step 1

Regress X on the set of instrumental variables Z and retain the fitted values:


.

Step 2

Regress y on the set of instruments, X, to give


.

Step 3

Calculate the Hausman statistic as described above.

If the Hausman statistic is statistically significant, then reject the hypothesis of no endogeneity and use the instrumental variables estimates. Otherwise, the OLS estimates are suitable.

The Hausman-Wu Test

An alternative approach to testing for endogeneity of a single variable in X is provided by the Hausman-Wu test:

Step 1

Regress Xk on the set of instrumental variables Z and retain the first-stage residuals:

e=Xk-Z(Z’Z)-1Z’Xk.

Step 2

Add the vector of first-stage residuals to the original regression specification,

y=Xb+eg+e=Wd+e,
where W=[X,e] and d=


Step 3

Estimate this equation by OLS and check whether the estimated coefficient on u is zero. If it is statistically significantly different from zero, then reject the hypothesis that Xk is not endogenous.

Notice that the b estimators obtained here are identical to the IV estimators obtained above. Notice also that, to obtain correct IV residuals and covariance matrix, the influence of e must be omitted from the calculation of s2. The correct covariance matrix is given by s2(W’W)-1.

Note that the classical distribution theory does not yield the result that the t-ratio on the coefficient of interest for the Wu test follows the t-distribution with the usual degrees of freedom. The t-ratio in this case is asymptotically normally distributed: a z-test (with a statement of asymptotic justification) is appropriate. If the same estimators of the error variance have been used to construct the Hausman statistic and the Hausman-Wu test, then the square of the t-ratio on the residual e identically equals the Hausman statistic.

Note that using SAS PC or SPSS/PC+ to perform a manual two-stage IV or Hausman-Wu estimation does not automatically produce the correct variance estimator.

In GAUSS-386, two sample programs, HAUSMAN.G and HAUSMNWU.G (Figure 22), illustrate the procedures described above. In both cases, the programs test whether variable X10 is correlated with the stochastic disturbance terms. For SAS PC and SPSS/PC+, it is simpler to use the procedures as indicated in the sample programs, HAUSMNWU.SAS and HAUSMNWU.SPS. Notice that the coefficient estimates, standard errors, and t-ratios are identical for both types of programs (t = 1.6238) and that the Hausman statistic is equal to the square of the t-ratio on the residual u of the Hausman-Wu technique (W = 2.637). The null hypothesis of no endogeneity of X10 cannot be rejected at the 5 percent level.

Recommended references: Berndt (1991, 379-380); Greene (1990, 303); Griffiths, Hill, and Judge (1993, 458-476); Hausman (1978); Kennedy (1985, 71, 80, 119, 138, 187; 1992, 135, 148, 169-170); Kmenta (1986, 365); Maddala (1988, 435-141).

Figure 22 - Sample programs for Hausman test and Hausman-Wu test, in GAUSS-386

22a - HAUSMAN.G program

/*****************************************************************
* PROGRAM: HAUSMAN.G SOFTWARE: GAUSS-386-V3.0
* FILENAME DESCRIPTION
* INPUTS: DATA.DAT GAUSS-386 DATA SET
* PURPOSE: PERFORM OLS AND IV ESTIMATION, THEN
* COMPARE THEM VIA THE HAUSMAN TEST TO
* CHECK FOR EVIDENCE OF MEASUREMENT ERROR
*****************************************************************/

FORMAT /M2 /RD 12,4;
OUTPUT FILE = HAUSMAN.OUT RESET;
NAMES = GETNAME("DATA");
OPEN D = DATA VARINDXI;
NCASE = ROWSF(D);
DATA = READR(D,NCASE);
F = CLOSE(D);
Y = DATA[.,IY1];

XO = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3];

NAMESX = NAMES[IX1 IX2 IX8 IX9 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.];

X10 = DATA[.,IX10];

ZO = DATA[.,IX4 IX5 IX6 IX7 IX11 IX12 ];

NAMESZ = NAMES[IX4 IX5 IX6 IX7 IX11 IX12,.];

@-------- OLS ESTIMATION ---------@

X = XO ~ X10;
NAMESX = NAMESX | "X10";

K = COLS(X);

B = INV(X'X)*X'Y; @ BETAS @
E = Y - X*B; @ RESIDUALS @
RSS = E'E; @ RESIDUAL SUM OF SQ @
SOLS = INV(NCASE-K)*RSS; @ LS ERROR VARIANCE @
SER = SQRT(SOLS); @ STD ERR OF REGRESSION @
RSQ = 1 - RSS/((NCASE-1)*(STDC(Y))^2); @ R-SQUARED @
COV = INV(NCASE-K)*RSS*INV(X'X); @ OLS COVARIANCE MATRIX @
SE = SQRT(DIAG(COV)); @ STD ERRS OF BETAS @
T = B ./ SE; @ T-STATISTICS FOR BETAS @
PT = 2*CDFTC(ABS(T),(NCASE-K)); @ P-VALUES @
PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @

BOLS = B;
COVOLS = COV;

" ";
" ";
" ";
" OLS RESULTS ";
" ";
" ";
" NUMBER OF OBSERVATIONS = ";; NCASE;
" STANDARD ERROR OF REGRESSION = ";; SER;
" RESIDUAL SUM OF SQUARES = ";; RSS;
" R-SQUARED = ";; RSQ;
" ";
" ";
" VARIABLE COEFF STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K -1;

FORMAT /M1 /RD 12,8; $NAMESX[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;
"F";

@-------- INSTRUMENTAL VARIABLES ESTIMATION -------- @

Z = XO ~ ZO; @ NOTE THAT Z HAS ZO @

@ AND ALL X EXCEPT X10 @

K = COLS(X);

PZX = INV(X'Z*INV(Z'Z)*Z'X); @ X,Z PROJECTION INV @
BIV = PZX*X'Z*INV(Z'Z)*Z'Y; @ IV ESTIMATOR @
E = Y - X*BIV; @ RESIDUALS @
RSS = E'E; @ RESIDUAL SUM OF SQ @
SIV = INV(NCASE-K)*RSS; @ LS ERROR VARIANCE @
SER = SQRT(INV(NCASE-K)*RSS); @ STD ERR OF REGRESSION @
RSQ = 1 - RSS/((NCASE-1)*(STDC(X10))^2); @ R-SQUARED @
COV = INV(NCASE-K)*RSS*PZX; @ IV COVARIANCE MATRIX @
SE = SQRT(DIAG(COV)); @ STD ERRS OF BETAS @
T = BIV ./ SE; @ T-STATISTICS FOR BETAS @
PT = 2*CDFNC(ABS(T)); @ P-VALUES @
PRN = BIV ~ SE ~ T ~ PT; @ FOR PRINTING @

" ";
" ";
" ";
" INSTRUMENTAL VARIABLES RESULTS ";
" ";
" ";

" NUMBER OF OBSERVATIONS = ";; NCASE;
" STANDARD ERROR OF REGRESSION = ";; SER;
" RESIDUAL SUM OF SQUARES = ";; RSS;
" R-SQUARED = ";; RSQ;
" ";
" ";
" VARIABLE COEFF STD ERROR ASY Z -RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K - 1;

FORMAT /M1 /RD 12,8; $NAMESX[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;

@-------- CALCULATION OF HAUSMAN TEST STATISTIC --------@

XXI = INV(X'X);

Q = BOLS[K,.] - BIV[K,.];

V = SIV*( PZX[K,K] - XXI[K,K] );

W = Q'INV(V)*Q;

DF = 1;

PW = CDFCHIC(W,DF);
" ";
" ";
" ";
" HAUSMAN TEST STATISTIC: W =";; W;
" ";
FORMAT /M1 /RD 3,0;
" DEGREES OF FREEDOM: =";; DF;
" ";
FORMAT /M1 /RD 12,4;
" P-VALUE =";; PW;
" ";
"f";

OUTPUT FILE = HAUSMAN.OUT OFF;
SYSTEM;

Figure 22b - HAUSMNWU.G program

/**********************************************************************
* PROGRAM: HAUSMNWU.G SOFTWARE: GAUSS-386 V3.0
* FILENAME DESCRIPTION
* INPUTS: DATA.DAT GAUSS-386 DATA SET
* PURPOSE: PERFORM OLS AND IV ESTIMATION, THEN
* CHECK FOR ENDOGENEITY VIA THE WU TEST
**********************************************************************/

FORMAT /M2 /RD 12,4;
OUTPUT FILE = HAUSMNWU.OUT RESET;

NAMES = GETNAME("DATA");
OPEN D = DATA VARINDXI;
NCASE = ROWSF(D);
DATA = READR(D,NCASE);
F = CLOSE(D);

Y = DATA[.,IY1];

XO = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3];

NAMESXO = NAMES[IX1 IX2 IX8 IX9 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.];

X10 = DATA[.,IX10];

ZO = DATA[.,IX4 IX5 IX6 IX7 IX11 IX12 ];

NAMESZO = NAMES[IX4 IX5 IX6 IX7 IX11 IX12,.];

@-------- OLS ESTIMATION ---------@

X = XO ~ X10;
NAMESX = NAMESXO | "X10";
K = COLS(X);

B = INV(X'X)*X'Y; @ BETAS @
E = Y - X*B; @ RESIDUALS @
RSS = E'E; @ RESIDUAL SUM OF SQ @
SER = SQRT(INV(NCASE-K)*RSS); @ STD ERR OF REGRESSION @
RSQ = 1 - RSS/((NCASE-1)*(STDC(Y))^2); @ R-SQUARED @
COV = INV(NCASE-K)*RSS*INV(X'X); @ OLS COVARIANCE MATRIX @
SE = SQRT(DIAG(COV)); @ STD ERRS OF BETAS @
T = B ./ SE; @ T-STATISTICS FOR BETAS @
PT = 2*CDFTC(ABS(T),(NCASE-K)); @ P-VALUES @
PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @

BOLS = B;
COVOLS = COV;

" ";
" ";
" ";
" OLS RESULTS ";
" ";
" ";
" NUMBER OF OBSERVATIONS = ";; NCASE;
" STANDARD ERROR OF REGRESSION = ";; SER;
" RESIDUAL SUM OF SQUARES = ";; RSS;
" R-SQUARED = ";; RSQ;
" ";
" ";
" VARIABLE COEFF STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K -1;

FORMAT /M1 /RD 12,8; $NAMESX[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;
"f";

@-------- TWO-STAGE LEAST-SQUARES CALCULATION OF --------@
@-------- INSTRUMENTAL VARIABLES ESTIMATORS ---------@

@-------- FIRST STAGE ---------@

Z = XO ~ ZO; @ NOTE THAT Z HAS ZO @

@ AND ALL X EXCEPT X10 @

K = COLS(Z);
NAMESZ = NAMESXO | NAMESZO;

G = INV(Z'Z)*Z'X10; @ OLS OF X10 ON Z @
X10FIT = Z*G; @ FITTED X10 @
U = X10 - X10FIT; @ RESIDUALS @
RSS = U'U; @ RESIDUAL SUM OF SQ @
SER = SQRT(INV(NCASE-K)*RSS); @ STD ERR OF REGRESSION @
RSQ = 1 - RSS/((NCASE-1)*(STDC(X10))^2); @ R-SQUARED @
COV = INV(NCASE-K)*RSS*INV(Z'Z); @ OLS COVARIANCE MATRIX @
SE = SQRT(DIAG(COV)); @ STD ERRS OF BETAS @
T = G ./ SE; @ T-STATISTICS FOR BETAS @
PT = 2*CDFTC(ABS(T),(NCASE-K)); @ P-VALUES @
PRN = G ~ SE ~ T ~ PT; @ FOR PRINTING @

" ";
" ";
" ";
" FIRST-STAGE RESULTS "; @ LISTING FULL DETAIL HERE IS @
" "; @ OPTIONAL; ONE MAY WISH TO @
" "; @ EXAMINE THE QUALITY OF THE @

@ FIRST STAGE. @

" NUMBER OF OBSERVATIONS = ";; NCASE;
" STANDARD ERROR OF REGRESSION = ";; SER;
" RESIDUAL SUM OF SQUARES = ";; RSS;
" R-SQUARED = ";; RSQ;
" ";
" ";
" VARIABLE COEFF STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K - 1;

FORMAT /M1 /RD 12,8; $NAMESZ[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;
"f";

@-------- SECOND STAGE ESTIMATION --------@
@-------- REPLACE X10 BY X10FIT --------@

XH = XO ~ X10FIT;
NAMESX = NAMESXO | "X10FIT";

K = COLS(XH);

B = INV(XH'XH)*XH'Y; @ IV BETAS @
E = Y - X*B; @ NOTE THAT RESIDUALS @

@ USE X NOT XH! @

RSS = E'E; @ RESIDUAL SUM OF SQ @
SER = SQRT(INV(NCASE-K)*RSS); @ STD ERR OF REGRESSION @
RSQ = 1 - RSS/((NCASE-1)*(STDC(Y))^2); @ R-SQUARED @
COV = INV(NCASE-K)*RSS*INV(XH'XH); @ IV COVARIANCE MATRIX @
SE = SQRT(DIAG(COV)); @ STD ERRS OF BETAS @
T = B ./ SE; @ T-STATISTICS FOR BETAS @
PT = 2*CDFNC(ABS(T)); @ P-VALUES @

PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @

BIV = B;
SIV = INV(NCASE-K)*RSS;

" ";
" ";
" ";
" INSTRUMENTAL VARIABLES RESULTS ";
" ";
" ";
" NUMBER OF OBSERVATIONS = ";; NCASE;
" STANDARD ERROR OF REGRESSION = ";; SER;
" RESIDUAL SUM OF SQUARES = ";; RSS;
" R-SQUARED = ";; RSQ;
" ";
" ";
" VARIABLE COEFF STD ERROR ASY Z-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K -1;

FORMAT /M1 /RD 12,8; $NAMESX[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;
"f";

@-------- THE WU TEST --------@
@-------- SPECIFY THE MATRIX OF REGRESSORS --------@
@-------- TO INCLUDE THE RESIDUAL U FROM --------@
@-------- THE FIRST STAGE OF THE IV PROCEDURE --------@

XW = XO ~ X10 ~ U;
NAMESW = NAMESXO | "X10" | "U";

B = INV(XW'XW)*XW'Y; @ WU BETAS @

K = COLS(XW) - 1;

E = Y - XW[.,1:K]*B[1:K,.]; @ RESIDUALS @

@ OMIT EFFECT OF U @

RSS = E'E; @ RESIDUAL SUM OF SQ @

SER = SQRT(INV(NCASE-K)*RSS); @ STD ERR OF REGRESSION @
RSQ = 1 - RSS/((NCASE-1)*(STDC(Y))^2); @ R-SQUARED @
COV = INV(NCASE-K)*RSS*INV(XW'XW); @ IV COVARIANCE MATRIX @
SE = SQRT(DIAG(COV)); @ STD ERRS OF BETAS @
T = B ./ SE; @ T-STATISTICS FOR BETAS @
PT = 2*CDFNC(ABS(T)); @ P-VALUES @

PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @

BIV = B;
SIV = INV(NCASE-K)*RSS;

" ";
" ";
" ";
" RESULTS FOR WU TEST REGRESSION";
" ";
" ";
" NUMBER OF OBSERVATIONS = ";; NCASE;
" STANDARD ERROR OF REGRESSION = ";; SER;
" RESIDUAL SUM OF SQUARES = ";; RSS;
" R-SQUARED = ";; RSQ;
" ";
" ";
" VARIABLE COEFF STD ERROR ASY Z-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K;

FORMAT /M1 /RD 12,8; $NAMESW[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;
" ";
"f";

OUTPUT FILE = HAUSMNWU.OUT OFF;
SYSTEM;

Figure 23 - Sample program for Hausman-Wu test, in SAS PC

***********************************************************************
* PROGRAM: HAUSMNWU.SAS SOFTWARE: SAS PC 6.04
* FILENAME DESCRIPTION
* INPUTS: DATA.SSD TEST DATA SET
* PURPOSE: PERFORM HAUSMAN-WU TEST.
**********************************************************************;
LIBNAME CDRV 'C:DATA';

* HAUSMAN TEST WHERE VARIABLE X10 IS SUSPECTED OF BEING ENDOGENOUS IN THE
* FOLLOWING MODEL.
* PROC REG DATA=CDRV.DATA;
* MODEL Y1=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3;

* VARIABLES X4, X5, X6, X7, X11, AND X12 ARE USED AS
* IDENTIFYING INSTRUMENTS FOR X10.;

* STEP 1: REGRESS X10 AGAINST EXOGENOUS EXPLANATORY VARIABLES (X1, X2, X8, X9,
* X13, X14, X15, D1, D2, D3, D5, D6, D7, D8, RD1, RD2, AND RD3) AND THE
* IDENTIFYING INSTRUMENTS (X4, X5, X6, X7, X11, AND X12) AND SAVE THE
* RESIDUALS OF X10 AS RX10.;

PROC REG DATA=CDRV.DATA;

MODEL X10=X1 X2 X8 X9 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3

X4 X5 X6 X7 X11 X12;

OUTPUT OUT=HDATA1 R = RX10;

RUN;

* STEP 2: RUN ORIGINAL REGRESSION MODEL WITH BOTH X10 AND RX10 AS EXPLANATORY
* VARIABLES.;

PROC REG DATA=HDATA1 OUTEST=HBETA;

MODEL Y1=X1 X2 X8 X9 X10 RX10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3;
OUTPUT OUT=HDATA2 R=RY1;

RUN;

* NEED TO ADD CONSTANT TO BOTH DATA SETS TO MERGE BY;

DATA HDATA3;

SET HDATA2;
CONSTANT = 1;

DATA HBETA2;

SET HBETA(RENAME=(X1=CX1 X2=CX2 X8=CX8 X9=CX9 X10=CX10 RX10=CRX10

X13=CX13 X14=CX14 X15=CX15 D1=CD1 D2=CD2 D3=CD3
D5=CD5 D6=CD6 D7=CD7 D8=CD8
RD1=CRD1 RD2=CRD2 RD3=CRD3 Y1=CY1));

CONSTANT = 1;

* STEP 3: STEP 2 PRODUCES THE CORRECT INSTRUMENTAL VARIABLE (IV) ESTIMATES, BUT
* GENERATES RESIDUALS (AND THEREFORE TEST STATISITICS) THAT ARE BASED ON
* X10 AND RX10, WHEREAS THEY SHOULD ONLY BE BASED ON THE IV ESTIMATES AND X10.;
* SAVE THE IV COEFFICIENTS FROM STEP 2, AND GENERATE APPROPRIATE
* RESIDUALS (DROPPING R10X) AND SAVE CORRECT (E'E);

DATA HWRESOK;

MERGE HDATA3 HBETA2;

BY CONSTANT;

RESOK = Y1 - (INTERCEP + CX1*X1 + CX2*X2 + CX8*X8 + CX9*X9 +

CX10*X10 + CX13*X13 + CX14*X14 + CX15*X15 +
CD1*D1 + CD2*D2 + CD3*D3 + CD5*D5 + CD6*D6 +
CD7*D7 + CD8*D8 + CRD1*RD1 + CRD2*RD2 + CRD3*RD3);

RESOKSQ = RESOK ** 2;
RY1SQ = RY1 ** 2;

PROC SUMMARY DATA=HWRESOK;

VAR RESOKSQ RY1SQ CONSTANT;
OUTPUT OUT=SUMRES SUM=SRESOKSQ SRY1SQ N;

PROC PRINT DATA=SUMRES;

DATA RESULTS;

SET SUMRES;
SIGMAOK = SRESOKSQ / (N - 19);
SIGMABAD = SRY1SQ / (N - 20);
CORFACT = (SIGMABAD/SIGMAOK) ** 0.5;

PROC PRINT DATA=RESULTS;

VAR CORFACT;

* STEP 4: MULTIPLY T'S FROM STEP 2 BY CORFACT TO GET APPROPRIATE T'S.;

* TEST STATISTIC CALCULATION FROM OUTPUT.
* IF THE CORRECTED T STATISTIC ON RX10 IS GREATER THAN THE
* CRITICAL T VALUE, THEN THE
* NULL HYPOTHESIS OF THE EXOGENEITY OF X10 IS REJECTED.
* FOR THIS EXAMPLE, UNCORRECTED T-RATIO ON RX10=1.6280. THE CORRECTION FACTOR
* IS 0.99744, SO THE CORRECT T-RATIO ON RX10 IS 1.6238.
* WE FAIL TO REJECT THE NULL HYPOTHESIS OF THE EXOGENITY OF X10.;

Figure 24 - Sample program for Hausman-Wu test, in SPSS/PC+

SET MORE OFF.
SET LIS = 'HAUSMNWU.LIS'.
SET LOG = 'HAUSMNWU.LOG'.
**********************************************************************
* PROGRAM: HAUSMNWU.SPS SOFTWARE: SPSS/PC+ 4.01
* FILENAME DESCRIPTION
* INPUTS: DATA.SYS TEST DATA SET
* PURPOSE: PERFORM HAUSMAN-WU TEST.
**********************************************************************.
GET FILE = 'DATA.SYS'.

* HAUSMAN TEST WHERE VARIABLE X10 IS SUSPECTED OF BEING ENDOGENOUS IN THE
* FOLLOWING MODEL.
* REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6
* D7 D8 RD1 RD2 RD3
* /DEPENDENT=Y1
* /METHOD=ENTER.

* VARIABLES X4, X5, X6, X7, X11, AND X12 ARE USED AS
* IDENTIFYING INSTRUMENTS FOR X10.

* STEP 1: REGRESS X10 AGAINST EXOGENOUS EXPLANATORY VARIABLES (X1, X2, X8, X9,
* X13, X14, X15, D1, D2, D3, D5, D6, D7, D8, RD1, RD2, AND RD3) AND THE
* IDENTIFYING INSTRUMENTS (X4, X5, X6, X7, X11, AND X12) AND SAVE THE
* RESIDUALS OF X10 AS RX10.

REGRESSION VARIABLES = X10 X1 X2 X8 X9 X13 X14 X15 D1 D2 D3 D5 D6

D7 D8 RD1 RD2 RD3
X4 X5 X6 X7 X11 X12
/DEPENDENT=X10
/METHOD=ENTER
/SAVE RESID(RX10).

* STEP 2: RUN ORIGINAL REGRESSION MODEL WITH BOTH X10 AND RX10 AS EXPLANATORY
* VARIABLES.

REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 RX10 X13 X14 X15 D1 D2 D3 D5 D6

D7 D8 RD1 RD2 RD3
/DEPENDENT=Y1
/METHOD=ENTER
/SAVE RESID(RY1).

SAVE OUT='HDATA.SYS'.

***************************************************************************************************
** VIEW THE OUTPUT FROM THIS REGRESSION AND USE THE BETA COEFFICIENTS TO
** COMPUTE A PREDICTED VALUE FOR OKAY RESIDUAL.
***************************************************************************************************.

GET FILE = 'HDATA.SYS'.

COMPUTE RESOK = Y1-(X1 * 35.595780 + X2 * 42.396332 +

X8 * -.133989 + X9 * -26.758650 +
X10 * 146.932871 +
X13 * -1.464556 + X14 * -1.349472 +
X15 * -6.187087 + D1 * -351.786732 +
D2 * -129.845550 + D3 * 214.195837 +
D5 * -320.026103 + D6 * 55.695210 +
D7 * 510.051877 + D8 * 785.090139 +
RD1 * 235.775053 + RD2 * 47.125762 +
RD3 * -145.397285 + 1145.454039).

COMPUTE RESOKSQ = RESOK ** 2.
COMPUTE RY1SQ = RY1 ** 2.
COMPUTE CONSTANT=1.

AGGREGATE OUTFILE=*

/BREAK=CONSTANT
/COUNT=N
/SRESOKSQ SRY1SQ = SUM(RESOKSQ RY1SQ).

COMPUTE SIGMAOK = SRESOKSQ / (COUNT - 19).
COMPUTE SIGMABAD = SRY1SQ / (COUNT - 20).
COMPUTE CORFACT = (SIGMABAD/SIGMAOK) ** 0.5.

FORMATS ALL (F9.5).
LIST CORFACT.

* STEP 4: MULTIPLY T'S FROM STEP 2 BY CORFACT TO GET APPROPRIATE T'S.

* TEST STATISTIC CALCULATION FROM OUTPUT.
* IF THE CORRECTED T STATISTIC ON RX10 IS GREATER THAN THE
* CRITICAL T VALUE, THEN THE
* NULL HYPOTHESIS OF THE EXOGENEITY OF X10 IS REJECTED.
* FOR THIS EXAMPLE, T ON RX10 = . WE FAIL TO REJECT THE NULL
* HYPOTHESIS OF THE EXOGENITY OF X10.

FINISH.

The Levi Bounds (for Assessing the Presence of Measurement Error)

The Levi bounds may be calculated to indicate the presence of measurement error. It is well known that if only one regressor is measured with error, the OLS coefficient of that regressor is biased toward zero. If the roles of this regressor and the dependent variable are reversed in the regression, the coefficient on the artificial regressor is an estimator of the inverse of the coefficient on the original regressor. This estimator is also biased toward zero, but its inverse is biased away from zero. If the coefficient on the original regressor is taken as a lower bound for a consistent estimator and the inverse of the coefficient on the artificial regressor is taken as an upper bound for a consistent estimator, then it is expected that the size of this interval reflects the severity of the measurement error problem.

Levi’s procedure is very simple to execute, but no formal statistical test is performed. Whether the interval between lower and upper bounds is “large” is a matter of judgment for the investigator. The steps below are presented in terms of a simple regression model; extension to the multiple regression model is straightforward.

Step 1

Estimate the regression, Yi=a1+b1Xi++e1, and get


. Call this

.

Step 2

Run the “reverse” regression,


and calculate

Now, examine the interval


<

<

.

As Kmenta notes, if this interval is small, the effect of measurement error is likely to be bearable and OLS results are unlikely to be severely biased. Note that the above discussion assumes that b1 is positive. If b1 is negative, then the lower and upper bounds are reversed.

The sample programs (Figures 25 through 27) treat variable X10 as possibly susceptible to measurement error. The results are striking:


= 217 and

= 5,732 (5,747 in SAS PC and SPSS/PC+, due to rounding). This appears to be a very large interval, particularly in view of the statistical significance of this regressor. It is concluded that measurement error is a problem for X10. These estimated coefficients translate into calorie-income elasticities of 0.1 and 2.0, respectively. From an economic viewpoint, this is a very large interval.

Recommended references: Kmenta (1986, 346-366); Levi (1977).

Figure 25 - Sample program for Levi bounds test, in GAUSS-386

/*************************************************************
* PROGRAM: LEVI.G SOFTWARE: GAUSS-386 V3.0
* FILENAME DESCRIPTION
* INPUTS: DATA.DAT GAUSS-386 DATA SET
* PURPOSE: CALCULATE "LEVI BOUNDS" FOR THE
* COEFFICIENT OF A REGRESSOR THAT MAY BE
* MEASURED WITH ERROR.
*************************************************************/

FORMAT /M2 /RD 12,4;
OUTPUT FILE = LEVI.OUT RESET;

NAMES = GETNAME("DATA");
OPEN D = DATA VARINDXI;
NCASE = ROWSF(D);
DATA = READR(D,NCASE);
F = CLOSE(D);
Y = DATA[.,IY1];

X = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3];

NAME1 = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.];

@-------- OLS ESTIMATION --------@

K = COLS(X);

B = INV(X'X)*X'Y; @ BETAS @
E = Y - X*B; @ RESIDUALS @
RSS = E'E; @ RESIDUAL SUM OF SQ @
SER = SQRT(INV(NCASE-K)*RSS); @ STD ERR OF REGRESSION @
RSQ = 1 - RSS/((NCASE-1)*(STDC(Y))^2); @ R-SQUARED @
COV = INV(NCASE-K)*RSS*INV(X'X); @ OLS COVARIANCE MATRIX @
SE = SQRT(DIAG(COV)); @ STD ERRS OF BETAS @
T = B ./ SE; @ T-STATISTICS FOR BETAS @
PT = 2*CDFTC(ABS(T),(NCASE-K)); @ P-VALUES @
PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @

B1 = B[6,1]; @ COEFF OF INTEREST @

" ";
" ";
" ";
" OLS RESULTS FOR THE STANDARD MODEL ";
" ";
" ";
" NUMBER OF OBSERVATIONS = ";; NCASE;
" STANDARD ERROR OF REGRESSION = ";; SER;
" RESIDUAL SUM OF SQUARES = ";; RSS;
" R-SQUARED = ";; RSQ;
" ";
" ";
" VARIABLE COEFF STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K -1;

FORMAT /M1 /RD 12,8; $NAME1[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;
" ";
"f";

@-------- OLS ESTIMATION --------@
@-------- WITH Y AND X10 REVERSED --------@

X10 = DATA[.,IX10];

X = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IY1 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3];

NAME2 = NAMES[IX1 IX2 IX8 IX9 IY1 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.];

K = COLS(X);
B = INV(X'X)*X'X10; @ BETAS @
E = X10 - X*B; @ RESIDUALS @
RSS = E'E; @ RESIDUAL SUM OF SQ @
SER = SQRT(INV(NCASE-K)*RSS); @ STD ERR OF REGRESSION @
RSQ = 1 - RSS/((NCASE-1)*(STDC(X10))^2); @ R-SQUARED @
COV = INV(NCASE-K)*RSS*INV(X'X); @ OLS COVARIANCE MATRIX @
SE = SQRT(DIAG(COV)); @ STD ERRS OF BETAS @
T = B ./ SE; @ T-STATISTICS FOR BETAS @
PT = 2*CDFTC(ABS(T),(NCASE-K)); @ P-VALUES @
PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @

B2 = B[6,.]; @ COEFF OF INTEREST @

" ";
" ";
" ";
" OLS RESULTS FOR THE REGRESSION MODEL WITH Y AND X10 REVERSED";
" ";
" ";
" NUMBER OF OBSERVATIONS = ";; NCASE;
" STANDARD ERROR OF REGRESSION = ";; SER;
" RESIDUAL SUM OF SQUARES = ";; RSS;
" R-SQUARED = ";; RSQ;
" ";
" ";
" VARIABLE COEFF STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K -1;

FORMAT /M1 /RD 12,8; $NAME2[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;
" ";
" ";
" ";
" BOUNDS FOR THE COEFFICIENT ON X10";
" ";
" LOWER BOUND: B =";; B1;
" ";
B2 = 1/B2;
" UPPER BOUND: B =";; B2;

"f";

OUTPUT FILE = LEVI.OUT OFF;
SYSTEM;

Figure 26 - Sample program for Levi bounds test, in SAS PC

*************************************************************
* PROGRAM: LEVI.SAS SOFTWARE: SAS PC 6.04
* FILENAME DESCRIPTION
* INPUTS: DATA.SSD TEST DATA SET
* PURPOSE: CALCULATE LEVI BOUNDS.
************************************************************;
LIBNAME CDRV 'C:DATA';

* X10 IS THE VARIABLE WE SUSPECT IS MEASURED WITH ERROR.
* STEP 1: RUN THE MODEL IN OLS.;

PROC REG DATA=CDRV.DATA;

MODEL Y1=X10 X1 X2 X8 X9 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3;

RUN;

* STEP 2: REVERSE Y1 AND X10 AND RUN THE MODEL IN OLS.;

PROC REG DATA=CDRV.DATA;

MODEL X10=Y1 X1 X2 X8 X9 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3;

RUN;

* INTERPRETATION OF OUTPUT.
* IF X10 IS MEASURED WITH RANDOM ERROR AND THE TRUE PARAMETER ON X10
* IS POSITIVE,THE ESTIMATED OLS COEFFICIENT ON X10 WILL BE BIASED TOWARDS ZERO.
* SINCE THE ESTIMATED OLS COEFFICIENT ON Y1 IN STEP 2 IS AN ESTIMATE OF
* THE RECIPRICAL OF THE PARAMETER ON X10 IN STEP 1, IT WILL BE BIASED AWAY
* FROM ZERO. IF INTERVAL BETWEEN THESE TWO OLS ESTIMATES IS NARROW (WITHIN
* PLAUSIBLE BEHAVIORAL BOUNDS), THEN MEASUREMENT ERROR ON X10 IS WITHIN
* ACCEPTABLE LIMITS.;

* FOR THIS EXAMPLE, THE LEVI BOUNDS ON X10'S OLS ESTIMATE ARE 216.97 AND
* 5747.126 AT THE MEAN OF X10 AND Y1. THESE TRANSLATE INTO ELASTICITES OF
* APPROXIMATELY 0.1 AND 2.0 RESPECTIVELY. FROM AN ECONOMIC VIEWPOINT,
* THIS IS A VERY LARGE INTERVAL.;

Figure 27 - Sample program for Levi bounds test, in SPSS/PC+

SET MORE OFF.
SET LIS = 'LEVI.LIS'.
SET LOG = 'LEVI.LOG'.
****************************************************************
* PROGRAM: LEVI.SPS SOFTWARE: SPSS/PC+ 4.01
* FILENAME DESCRIPTION
* INPUTS: DATA.SYS TEST DATA SET
* PURPOSE: CALCULATES LEVI BOUNDS.
***************************************************************.

GET FILE = 'DATA.SYS'.

* X10 IS THE VARIABLE WE SUSPECT IS MEASURED WITH ERROR.
* STEP 1: RUN THE MODEL IN OLS.

REGRESSION VARIABLES = Y1 X10 X1 X2 X8 X9 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8

RD1 RD2 RD3

/DEPENDENT = Y1
/METHOD = ENTER.

* STEP 2: REVERSE Y1 AND X10 AND RUN THE MODEL IN OLS.

REGRESSION VARIABLES = Y1 X10 X1 X2 X8 X9 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8

RD1 RD2 RD3

/DEPENDENT = X10
/METHOD = ENTER.

* INTERPRETATION OF OUTPUT.
* IF X10 IS MEASURED WITH RANDOM ERROR AND THE TRUE PARAMETER ON X10 IS
* POSITIVE, THE ESTIMATED OLS COEFFICIENT ON X10 WILL BE BIASED TOWARDS ZERO.
* SINCE THE ESTIMATED OLS COEFFICIENT ON Y1 IN STEP 2 IS AN ESTIMATE OF
* THE RECIPROCAL OF THE PARAMETER ON X10 IN STEP 1, IT WILL BE BIASED AWAY
* FROM ZERO. IF INTERVAL BETWEEN THESE TWO OLS ESTIMATES IS NARROW (WITHIN
* PLAUSIBLE BEHAVIORAL BOUNDS), THEN MEASUREMENT ERROR ON X10 IS WITHIN
* ACCEPTABLE LIMITS.
* FOR THIS EXAMPLE, THE LEVI BOUNDS ON X10'S OLS ESTIMATE ARE 216.97 AND
* 5747.126 AT THE MEAN OF X10 AND Y1. THESE TRANSLATE INTO ELASTICITES
* OF APPROXIMATELY 0.1 AND 2.0 RESPECTIVELY. FROM AN ECONOMIC VIEWPOINT,
* THIS IS A VERY LARGE INTERVAL.
FINISH.

TESTS FOR NONNESTED HYPOTHESES

This class of tests is used to test the validity of one model for explaining y versus another model for explaining y when neither model can be obtained by imposing linear restrictions on the other model. These “model validity” tests are popular because they allow all competing models to be rejected if all are deficient (unlike “model selection” methods—such as high R2 criteria, backwards elimination, or stepwise regression—in which one model will always be chosen).

The following models are nonnested models, because Z is not a subset of W, nor is W a subset of Z:

y=Xb+Zg+e1 (2)

y=Xb+Wd+e2 (3)

In these competing models that explain y, the explanatory variables are contained in X, Z, and W, which are of the dimension N × K1, N × K2, and N × K3 respectively. The coefficient vectors are conformable. It is important to note that tests of these models all assume that the stochastic disturbance terms satisfy the classical assumptions.

Two popular tests for nonnested models, the nonnested F-test and the nonnested J-test, are explained below.

Nonnested F-Test

The strategy of this test is to artificially nest the two competing models in a more general model and then to test whether the restrictions that produce either original model (or both) are valid.

Step 1

Form the general model:

y=Xb+Zg+Wd+e (4)

Step 2

Estimate the general model (4) using OLS.

Step 3

Use F-tests for incremental explanatory power to test the following three sets of hypotheses:

H0: g = 0,
H1: g ¹ 0;
H0: d = 0,
H1: d ¹ 0;
H0: g =d = 0,
H1: g and d are not both 0.

Note that the last hypothesis cannot be addressed using the F-tests for coefficients on Z and W: for the last hypothesis you need to construct an F-test for the joint incremental explanatory power of Z and W.

Step 4

If the estimates of g or d are not significantly different from zero, the model that includes the corresponding set of variables is rejected. If both sets of coefficients are significantly different from zero, then the general model (4) is preferred; if neither is significantly different from zero, then the restricted model,

y=Xb+e, (5)

may be adequate.

In the sample programs (Figures 28 through 30), X is taken to include a constant and variables X1, X2, X8, X9, X10, X13, X14, X15, D1, D2, D3, D5, D6, D7, D8, RD1, RD2, and RD3. Then Z = [X3, X7] and W = [X6, X12].

For the sample data set, the F-statistic for the hypothesis that g = 0 is 2.5261 (P-value = 0.0803): the variables Z should be retained in the model. The F-statistic for the hypothesis that d= 0 is 1.9970 (P-value = 0.1361): the variables W only have significant explanatory power at a significance level of, say 15 percent. Investigators who prefer to use smaller significance levels, say 10 percent or 5 percent, would fail to reject this null hypothesis and would choose model 2 over model 3 at this point (that is, include Z but not W). Finally, the F-statistic for the hypothesis g= d= 0 is 2.2438 (P-value = 0.0622), and, at the 7 percent significance level, it is concluded that Z and W are jointly significant. The completely unrestricted model is most appropriate.

This test and several others in this manual are F-tests for linear restrictions on coefficients. Good general expositions of F-tests are given in Greene (1990, Chapter 7) and Kmenta (1986, Section 10-2). See Testing for Structural Change in this manual for a fuller exposition of an F-test.

Recommended references: Davidson and MacKinnon (1981, 781-793); Greene (1990, 231-234); Kennedy (1985, 70, 79-80, 85-87; 1992, 81, 87-88); Kmenta (1986, 595-600); MacKinnon (1983, 85-158); Maddala (1988, 443-446); McAleer and Pesaran (1986, 217-371).

Figure 28 - Sample program for nonnested F-test, in GAUSS-386

/****************************************************************
* PROGRAM: NNESTF.G SOFTWARE: GAUSS-386 V3.0
* FILENAME DESCRIPTION
* INPUTS: DATA.DAT GAUSS-386 DATA SET
* PURPOSE: PERFORM NON-NESTED F-TEST
****************************************************************/

OUTPUT FILE = NNESTF.OUT RESET;
FORMAT /M2 /RD 12,4;
NAMES = GETNAME("DATA");
OPEN D = DATA VARINDXI;
NCASE = ROWSF(D);
K = COLSF(D);
DATA = READR(D,NCASE);
F = CLOSE(D);

@-------- SELECT VARIABLES THAT WILL BE USED --------@

Y1 = DATA[.,IY1];
X0 = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3];

Z = DATA[.,IX3 IX7];
W = DATA[.,IX6 IX12];

@-------- SELECT VARIABLE NAMES CORRESPONDING WITH VARIABLES --------@
@-------- USED IN ALTERNATIVE MODELS. NAMES MUST BE LISTED --------@
@-------- IN THE SAME ORDER AS THE VARIABLES APPEAR IN X. --------@

NAMESU = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3
IX3 IX7 IX6 IX12,.];

NAMES1 = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15
ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3
IX3 IX7,.];

NAMES2 = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15
ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3
IX6 IX12,.];

NAMES3 = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15
ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.];

@ -------- MODEL U ----------@

@ -------- UNRESTRICTED MODEL THAT INCLUDES BOTH Z AND W ----------@

X = X0 ~ Z ~ W;

K0 = COLS(X);

B = INV(X'X)*X'Y1; @ OLS ESTIMATION @
E = Y1 - X*B; @ RESIDUALS @
RSSU = E'E; @ UNRESTRICTED RSS @
SER = SQRT(INV(NCASE-K0)*RSSU); @ S.E. OF REGRESSION @
RSQ = 1 - RSSU/((NCASE-1)*(STDC(Y1))^2) ; @ R-SQUARED @
COV = INV(NCASE-K0)*RSSU*INV(X'X); @ VAR-COV MATRIX @
SE = SQRT(DIAG(COV)); @ S.E. OF ESTIMATES @
T = B ./ SE; @ T-STATISTICS @
PT = 2*CDFTC(ABS(T),(NCASE-K0)); @ P-VALUES @
PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @

@ --------- PRINT RESULTS -----------@

" OLS RESULTS FOR UNRESTRICTED MODEL ";
" ";
" NUMBER OF OBSERVATIONS = ";; NCASE;
" STANDARD ERROR OF REGRESSION = ";; SER;
" RESIDUAL SUM OF SQUARES = ";; RSSU;
" R-SQUARED = ";; RSQ;
" ";
" VARIABLE COEFF. STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K0-1;

FORMAT /M1 /RD 12,8; $NAMESU[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;
" ";
"f";

@ --------- MODEL 1 -----------@

@ --------- RESTRICTED MODEL THAT EXCLUDES W -----------@

X = X0 ~ Z;
K1 = COLS(X);

B = INV(X'X)*X'Y1; @ OLS ESTIMATION @
E = Y1 - X*B; @ RESIDUALS @
RSSR1 = E'E; @ RESTRICTED RSS 1 @
SER = SQRT(INV(NCASE-K1)*RSSR1); @ S.E. OF REGRESSION @
RSQ = 1 - RSSR1/((NCASE-1)*(STDC(Y1))^2); @ R-SQUARED @
COV = INV(NCASE-K1)*RSSR1*INV(X'X); @ VAR-COV MATRIX @
SE = SQRT(DIAG(COV)); @ S.E. OF ESTIMATES @
T = B ./ SE; @ T-STATISTICS @
PT = 2*CDFTC(ABS(T),(NCASE-K1)); @ P-VALUES @
PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @

@ --------- PRINT RESULTS -----------@

" OLS RESULTS FOR RESTRICTED MODEL THAT EXCLUDES W";
" ";
" NUMBER OF OBSERVATIONS = ";; NCASE;
" STANDARD ERROR OF REGRESSION = ";; SER;
" RESIDUAL SUM OF SQUARES = ";; RSSR1;
" R-SQUARED = ";; RSQ;
" ";
" VARIABLE COEFF. STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K1-1;

FORMAT /M1 /RD 12,8; $NAMES1[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;

" ";
"f";

@ --------- MODEL 2 -----------@
@ --------- RESTRICTED MODEL THAT EXCLUDES Z -----------@

X = X0 ~ W;
K2 = COLS(X);

B = INV(X'X)*X'Y1; @ OLS ESTIMATION @
E = Y1 - X*B; @ RESIDUALS @
RSSR2 = E'E; @ RESTRICTED RSS 2 @
SER = SQRT(INV(NCASE-K2)*RSSR2); @ S.E. OF REGRESSION @
RSQ = 1 - RSSR2/((NCASE-1)*(STDC(Y1))^2); @ R-SQUARED @
COV = INV(NCASE-K2)*RSSR2*INV(X'X); @ VAR-COV MATRIX @
SE = SQRT(DIAG(COV)); @ S.E. OF ESTIMATES @
T = B ./ SE; @ T-STATISTICS @
PT = 2*CDFTC(ABS(T),(NCASE-K2)); @ P-VALUES @
PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @

@ --------- PRINT RESULTS -----------@

" OLS RESULTS FOR RESTRICTED MODEL THAT EXCLUDES Z";
" ";
" NUMBER OF OBSERVATIONS = ";; NCASE;
" STANDARD ERROR OF REGRESSION = ";; SER;
" RESIDUAL SUM OF SQUARES = ";; RSSR2;
" R-SQUARED = ";; RSQ;
" ";
" VARIABLE COEFF. STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K2-1;

FORMAT /M1 /RD 12,8; $NAMES2[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;

" ";
"f";

@ --------- MODEL 3 -----------@
@ --------- RESTRICTED MODEL THAT EXCLUDES Z AND W -----------@

X = X0;

K3 = COLS(X);

B = INV(X'X)*X'Y1; @ OLS ESTIMATION @

E = Y1 - X*B; @ RESIDUALS @
RSSR3 = E'E; @ RESTRICTED RSS 3 @
SER = SQRT(INV(NCASE-K2)*RSSR3); @ S.E. OF REGRESSION @
RSQ = 1 - RSSR3/((NCASE-1)*(STDC(Y1))^2); @ R-SQUARED @
COV = INV(NCASE-K2)*RSSR3*INV(X'X); @ VAR-COV MATRIX @
SE = SQRT(DIAG(COV)); @ S.E. OF ESTIMATES @
T = B ./ SE; @ T-STATISTICS @
PT = 2*CDFTC(ABS(T),(NCASE-K3)); @ P-VALUE @
PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @

@ --------- PRINT RESULTS -----------@

" OLS RESULTS FOR RESTRICTED MODEL THAT EXCLUDES Z AND W";
" ";
" NUMBER OF OBSERVATIONS = ";; NCASE;
" STANDARD ERROR OF REGRESSION = ";; SER;
" RESIDUAL SUM OF SQUARES = ";; RSSR3;
" R-SQUARED = ";; RSQ;
" ";
" VARIABLE COEFF. STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K3-1;

FORMAT /M1 /RD 12,8; $NAMES3[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;

" ";
"f";

@--------- F-TESTS FOR INCREMENTAL EXPLANATORY POWER ----------@

F1 = ( (RSSR1 - RSSU)/(K0 - K1) ) / ( RSSU/(NCASE - K0) );
PROB1 = CDFFC(F1,(K0 - K1),(NCASE - K0));

F2 = ( (RSSR2 - RSSU)/(K0 - K2) ) / ( RSSU/(NCASE - K0) );
PROB2 = CDFFC(F2,(K0 - K2),(NCASE - K0));

F3 = ( (RSSR3 - RSSU)/(K0 - K3) ) / ( RSSU/(NCASE - K0) );
PROB3 = CDFFC(F3,(K0 - K3),(NCASE - K0));

" F-TESTS FOR INCREMENTAL EXPLANATORY POWER ";

" ";
" ";
" F-TEST FOR MODEL 1 VS MODEL U: F =";; F1;; " PROB =";; PROB1;
" ";
" ";
" F-TEST FOR MODEL 2 VS MODEL U: F =";; F2;; " PROB =";; PROB2;
" ";
" ";
" F-TEST FOR MODEL 3 VS MODEL U; F =";; F3;; " PROB =";; PROB3;

"f";

OUTPUT FILE = NNESTF.OUT OFF;
SYSTEM;

Figure 29 - Sample program for nonnested F-test, in SAS PC

**************************************************************
* PROGRAM: NNESTF.SAS SOFTWARE: SAS PC 6.04
* FILENAME DESCRIPTION
* INPUTS: DATA.SSD TEST DATA SET
* PURPOSE: PERFORM NON-NESTED F TEST.
*************************************************************;

LIBNAME CDRV 'C:DATA';

* ALL VARIABLES EXCEPT X3, X7, X6 AND X12 ARE COMMON TO ALL
* MODELS. SPECIFICATION 1 CONTAINS X3 AND X7.
* SPECIFICATION 2 CONTAINS X6 AND X12.;
* SPECIFICATION 3 DOES NOT CONTAIN X6, X12, X3, AND X7.;
PROC REG DATA=CDRV.DATA;

MODEL Y1=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5

D6 D7 D8 RD1 RD2 RD3 X3 X7 X6 X12;

B1 : TEST X6=X12=0;
B2 : TEST X3=X7=0;
B3 : TEST X3=X7=X6=X12=0;

RUN;

* THE 'TEST' COMMANDS PRODUCE THE 3 F-STATISTICS DESCRIBED IN THE TEXT;
* F FROM TEST B1 = 1.9970
* F FROM TEST B2 = 2.5261
* F FROM TEST B3 = 2.2438
* CONCLUSION: RETAIN ALL FOUR VARIABLES

Figure 30 - Sample program for nonnested F-test, in SPSS/PC+

SET MORE OFF.
SET LIS = 'NNESTF.LIS'.
SET LOG = 'NNESTF.LOG'.
********************************************************************
* PROGRAM: NNESTF.SPS SOFTWARE: SPSS/PC+ 4.01
* FILENAME DESCRIPTION
* INPUTS: DATA.SYS TEST DATA SET
* PURPOSE: PERFORM NON-NESTED F TEST.
*******************************************************************.

GET FILE = 'DATA.SYS'.

* ALL VARIABLES EXCEPT X3, X7, X6 AND X12 ARE COMMON TO ALL MODELS.
* SPECIFICATION 1 CONTAINS X3 AND X7.
* SPECIFICATION 2 CONTAINS X6 AND X12.
* SPECIFICATION 3 DOES NOT CONTAIN X3, X7, X6, OR X12.
* SPECIFICATION 4 CONTAINS X3, X7, X6, AND X12.

* STEP 1: ESTIMATE SPECIFICATION 1.

REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6

D7 D8 RD1 RD2 RD3 X3 X7
/DEPENDENT=Y1
/METHOD=ENTER.

* STEP 2: ESTIMATE SPECIFICATION 2.

REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6

D7 D8 RD1 RD2 RD3 X6 X12
/DEPENDENT=Y1
/METHOD=ENTER.

* STEP 3: ESTIMATE SPECIFICATION 3 (COMPLETELY RESTRICTED MODEL).

REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6

D7 D8 RD1 RD2 RD3
/DEPENDENT=Y1
/METHOD=ENTER.

* STEP 4: ESTIMATE SPECIFICATION 4 (COMPLETELY UNRESTRICTED MODEL).

REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6

D7 D8 RD1 RD2 RD3 X3 X7 X6 X12
/DEPENDENT=Y1
/METHOD=ENTER.

* TEST STATISTIC CALCULATION FROM OUTPUT.
* CALCULATE 3 F STATISTICS:
* SPECIFICATION 1 VERSUS 4.
* SPECIFICATION 2 VERSUS 4.
* SPECIFICATION 3 VERSUS 4.
* THESE ARE THE 3 F-STATISTICS DESCRIBED IN THE TEXT.
FINISH.

Nonnested J-Test

The J-test, developed by R. Davidson and J. G. MacKinnon, can be used to test whether one of two models having different (but possibly overlapping) sets of regressors has greater explanatory power than the other. Once again, it is assumed that the stochastic disturbance terms satisfy the classical assumptions. Let the competing models be

y=Xb+e1, and (6)

y=Zd+e2 (7)

The J-test proceeds in the following steps:

Step 1

Estimate the second equation by OLS and calculate the fitted values of y,


.
Variation in

reflects the linear influence on y of variation in the explanatory variables Z.

Step 2

Specify the augmented regression model,


,

where l is a scalar coefficient. Estimate this augmented model by OLS. If some of the explanatory variables in Z have significant explanatory power for y that is not captured by the regressors in X, then the estimate for l will be statistically significant.

Step 3

The standard t-ratio produced by statistical packages is asymptotically distributed as standard normal and may be compared to standard normal critical values to test the following hypothesis (see Greene 1990, 231-233):

H0: l=0
H1: l¹0

If H0 is rejected in favor of H1, then the second model has some explanatory power that is lacking in the first model.

Step 4

Reverse the roles of the two models and repeat the exercise.

Note that it is possible that, in both cases, the null hypothesis might be rejected. If both are rejected, then each model explains some variation that the other fails to explain; the investigator may consider some augmented model that includes regressors from both X and Z. If the null hypothesis is not rejected in both cases, then neither is preferred on the basis of this test. The investigator must use economic theory and/or other statistical results to choose.

The sample programs that illustrate this section (Figures 31 through 33) specify and test the following models:

y=Xb+Zg+e1 (8)

y=Xb+Wd+e2 (9)

These models are exactly the ones described in the preceding section, on the nonnested F-test.

In these results, the coefficient for YHAT2 (the fitted y values from model 7) in augmented specification 1 is 1.0372 with t-statistic = 2.2322 (P-value = 0.0258). This indicates that variables contained in W would contribute significant incremental explanatory power if included in model 6. By the same token the coefficient on YHAT1 in augmented specification 2 is 0.9560 with t-statistic = 1.8920 (P-value = 0.0586). This indicates that variables contained in Z would contribute significant incremental explanatory power if included in model 7. As expected, these results are qualitatively similar to those in the section on the nonnested F-test. Neither model dominates, and it appears that a model that includes variables from both specifications is called for. Notice that the t-statistics of the coefficients not associated with the fitted values in the augmented regressions are all quite small. This is because much of their explanatory power has been captured by the fitted y values and the fitted y values are collinear with the remaining variables. Figures 31 through 33 are sample programs for the nonnested J-test.

Recommended references: Davidson and MacKinnon (1981, 781-793); Greene (1990, 231-234); Judge et al. (1984, 884-885); Kennedy (1985, 70, 79-80, 85-87; 1992, 81, 87-88); Kmenta (1986, 595-600); Maddala (1988, 443-447); McAleer and Pesaran (1986).

Figure 31 - Sample program for nonnested J-test, in GAUSS-386

/*****************************************************************
* PROGRAM: NNESTJ.G SOFTWARE: GAUSS-386 V3.0
* FILENAME DESCRIPTION
* INPUTS: DATA.DAT GAUSS-386 DATA SET
* PURPOSE: PERFORM NON-NESTED J-TEST
*****************************************************************/

OUTPUT FILE = NNESTJ.OUT RESET;
FORMAT /M1 /RD 12,4;
NAMES = GETNAME("DATA");
OPEN D = DATA VARINDXI;
NCASE = ROWSF(D);
DATA = READR(D,NCASE);
F = CLOSE(D);
Y = DATA[.,IY1];
X0 = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3];

NAMES1 = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3 IX3 IX7,.];

NAMES2 = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3 IX6 IX12,.];

Z = DATA[.,IX3 IX7];
W = DATA[.,IX6 IX12];

@-------- CALCULATE FITTED YS FROM THE ALTERNATIVE MODELS --------@

X1 = X0 ~ Z;
B1 = INV(X1'X1)*X1'Y;
YHAT1 = X1*B1;
X2 = X0 ~ W;
B2 = INV(X2'X2)*X2'Y;
YHAT2 = X2*B2;

@-------- AUGMENTED REGRESSION 1 --------@

X1 = X1 ~ YHAT2;
K1 = COLS(X1);
B1 = INV(X1'X1)*X1'Y;
E1 = Y - X1*B1; @ RESIDUALS @
RSS1 = E1'E1; @ RESID SUM SQUARES @
SER = SQRT(INV(NCASE-K1)*RSS1); @ S.E. OF REGRESSION @
RSQ = 1 - RSS1/((NCASE-1)*(STDC(Y))^2); @ R-SQUARED @
COV = INV(NCASE-K1)*RSS1*INV(X1'X1); @ VAR-COV MATRIX @
SE = SQRT(DIAG(COV)); @ S.E. OF ESTIMATES @
T = B1 ./ SE; @ T-STATISTICS @
PT = 2*CDFTC(ABS(T),(NCASE-K1)); @ P-VALUES @
PRN = B1 ~ SE ~ T ~ PT; @ FOR PRINTING @

" REGRESSION RESULTS FOR AUGMENTED SPECIFICATION 1 ";
" ";
" NUMBER OF OBSERVATIONS = ";; NCASE;
" STANDARD ERROR OF REGRESSION = ";; SER;
" RESIDUAL SUM OF SQUARES = ";; RSS1;
" R-SQUARED = ";; RSQ;
" ";
" VARIABLE COEFF STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K1-2;

FORMAT /M1 /RD 12,8; $NAMES1[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;

" YHAT2 ";; PRN[K1,.];

"f";

@-------- AUGMENTED REGRESSION 2 --------@

X2 = X2 ~ YHAT1;
K2 = COLS(X2);
B2 = INV(X2'X2)*X2'Y;

E2 = Y - X2*B2; @ RESIDUALS @
RSS2 = E2'E2; @ RESID SUM SQUARES @
SER = SQRT(INV(NCASE-K2)*RSS2); @ S.E. OF REGRESSION @
RSQ = 1 - RSS2/((NCASE-1)*(STDC(Y))^2); @ R-SQUARED @
COV = INV(NCASE-K2)*RSS2*INV(X2'X2); @ VAR-COV MATRIX @
SE = SQRT(DIAG(COV)); @ S.E. OF ESTIMATES @
T = B2 ./ SE; @ T-STATISTICS @
PT = 2*CDFTC(ABS(T),(NCASE-K2)); @ P-VALUES @
PRN = B2 ~ SE ~ T ~ PT; @ FOR PRINTING @

" REGRESSION RESULTS FOR AUGMENTED SPECIFICATION 2 ";
" ";
" NUMBER OF OBSERVATIONS = ";; NCASE;
" STANDARD ERROR OF REGRESSION = ";; SER;
" RESIDUAL SUM OF SQUARES = ";; RSS2;
" R-SQUARED = ";; RSQ;
" ";
" VARIABLE COEFF STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K2-2;

FORMAT /M1 /RD 12,8; $NAMES2[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;

" YHAT1 ";; PRN[K2,.];

"f";

OUTPUT FILE = NNESTJ.OUT OFF;
SYSTEM;

Figure 32 - Sample program for nonnested J-test, in SAS PC

******************************************************************
* PROGRAM: NNESTJ.SAS SOFTWARE: SAS PC 6.04
* FILENAME DESCRIPTION
* INPUTS: DATA.SSD TEST DATA SET
* PURPOSE: PERFORM NON-NESTED J-TEST.
*****************************************************************;

LIBNAME CDRV 'C:DATA';

* ALL VARIABLES EXCEPT X3, X7, X6 AND X12 ARE COMMON TO ALL
* MODELS. SPECIFICATION 1 CONTAINS X3, X7, AND
* SPECIFICATION 2 CONTAINS X6, X12.;

* TO TEST SPECIFICATION 1 : FIRST ESTIMATE SPECIFICATION 2.;

PROC REG DATA=CDRV.DATA;

MODEL Y1=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8

RD1 RD2 RD3 X6 X12;

OUTPUT OUT=HAT2 P=YHAT2;

RUN;

* TO TEST SPECIFICATION 1 : NEXT FORCE PREDICTED VALUE FROM SPECIFICATION 2
* INTO SPECIFICATION 1;

PROC REG DATA=HAT2;

MODEL Y1=YHAT2 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8

RD1 RD2 RD3 X3 X7;

RUN;

* TO TEST SPECIFICATION 2 : FIRST ESTIMATE SPECIFICATION 1;

PROC REG;

MODEL Y1=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8

RD1 RD2 RD3 X3 X7;

OUTPUT OUT=HAT1 P=YHAT1;

RUN;

* TO TEST SPECIFICATION 2 : NEXT FORCE PREDICTED VALUE FROM SPECIFICATION 1
* INTO SPECIFICATION 2;

PROC REG DATA=HAT1;

MODEL Y1=YHAT1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8

RD1 RD2 RD3 X6 X12;

RUN;

* TEST STATISTIC CALCULATION FROM OUTPUT;
* THE T STATISITIC FOR YHAT2 IS 2.2322 AND THE T STATISTIC FOR YHAT1 IS 1.8920;
* SEE TEXT FOR INTERPRETATION OF RESULTS;

Figure 33 - Sample program for nonnested J-test, in SPSS/PC+

SET MORE OFF.
SET LIS = 'NNESTJ.LIS'.
SET LOG = 'NNESTJ.LOG'.
******************************************************************
* PROGRAM: NNESTJ.SPS SOFTWARE: SPSSPC+ 4.01
* FILENAME DESCRIPTION
* INPUTS: DATA.SYS TEST DATA SET
* PURPOSE: PERFORM NONNESTED J-TEST.
*****************************************************************.

GET FILE = 'DATA.SYS'.

* ALL VARIABLES EXCEPT X3, X7, X6 AND X12 ARE COMMON TO ALL
* MODELS. SPECIFICATION 1 CONTAINS X3, X7, AND
* SPECIFICATION 2 CONTAINS X6, X12.

* TO TEST SPECIFICATION 1 : FIRST ESTIMATE SPECIFICATION 2.

REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6

D7 D8 RD1 RD2 RD3 X6 X12
/DEPENDENT=Y1
/METHOD=ENTER
/SAVE PRED(YHAT2).

* TO TEST SPECIFICATION 1 : NEXT FORCE PREDICTED VALUE FROM SPECIFICATION 2
* INTO SPECIFICATION 1.

REGRESSION VARIABLES = Y1 YHAT2 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3

D5 D6 D7 D8 RD1 RD2 RD3 X3 X7
/DEPENDENT=Y1
/METHOD=ENTER.

* TO TEST SPECIFICATION 2 : FIRST ESTIMATE SPECIFICATION 1.

REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5

D6 D7 D8 RD1 RD2 RD3 X3 X7
/DEPENDENT=Y1
/METHOD=ENTER
/SAVE PRED(YHAT1).

* TO TEST SPECIFICATION 2 : NEXT FORCE PREDICTED VALUE FROM SPECIFICATION 1
* INTO SPECIFICATION 2.

REGRESSION VARIABLES = Y1 YHAT1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5

D6 D7 D8 RD1 RD2 RD3 X6 X12
/DEPENDENT=Y1
/METHOD=ENTER.

* TEST STATISTIC CALCULATION FROM OUTPUT.
* THE T STATISITIC FOR YHAT2 IS 2.2322 AND THE T STATISTIC FOR YHAT1 IS 1.8920.
* SEE TEXT FOR INTERPRETATION OF RESULTS.
FINISH.

OMISSION OF VARIABLES: THE RAMSEY RESET TEST

This version of the Regression Specification Error Test (RESET) may be used to test for omission of relevant explanatory variables. When one or more relevant variables (either unobserved or unobservable) are omitted from a model, the error term of the incorrect model includes the influence of the omitted variables. If proxy variable(s), Z, can be constructed to stand in for the omitted variable(s), a specification error test may be formed by testing if Z has significant incremental explanatory power for y.

In this version of RESET, a proxy variable matrix Z is constructed from the second, third, and fourth moments of the fitted values of y from the original model.

Let the model of interest be

y=Xb+e (10)

This model is “restricted” in the sense that it does not contain the proxy variables in matrix Z. The “augmented” model does contain them.

The RESET test is then conducted following the steps described below.

Step 1

Using OLS, estimate the restricted model (10).

Step 2

Calculate fitted values:


Step 3

Form the proxy variables as powers of the fitted values:


Step 4

Estimate the augmented model by OLS: regress y on


Step 5

Using an F-test, check if the coefficients on the columns of the Z matrix are jointly significant. If so, the null hypothesis of no specification error is rejected.

In the sample programs for the nonnested F-test and the nonnested J-test (previously discussed), we examined whether a model that contained variables X3 and X7 or variables X6 and X12 was to be preferred. Evidence was found that the preferred model would contain all four variables. In illustrating the RESET test, all of these variables will be omitted in forming the restricted model to check whether the RESET test detects this omission.

In fact, the F-test for incremental explanatory power yields an F-value of 0.6024 (P-value = 0.6135), and it is concluded that specification error is absent. The previous tests used X3 and X7, and X6 and X12, directly, but the RESET test uses no specific information about these variables. Thus, it is illustrated that the RESET test may not be powerful for detecting misspecification. If specific variables are to be tested to determine whether they should be included in a regression model, they should be tested explicitly rather than through a nonspecific test like RESET.

Figures 34 through 36 are sample programs for the Ramsey RESET Test.

NOTES:

1. Thursby (1979, 1981, 1982) discusses using RESET in conjunction with tests for other types of specification error.

2. A method that has been shown by Monte Carlo studies to be preferable to using powers of


is that of Thursby and Schmidt (1977). They used the second, third, and fourth powers of all explanatory variables to make up the proxy vector Z. However, with many explanatory variables, this may be unwieldy.

Recommended references: Griffiths, Hill, and Judge (1993, 498-499); Judge et al. (1984, 364); Kennedy (1985, 71, 81; 1992, 95, 102, 104); Kmenta (1986, 452-455); Maddala (1988, 162, 407); Ramsey (1969, 350-371); Thursby (1979, 222-225; 1981, 117-123; 1982, 314-321); Thursby and Schmidt (1977, 635-641).

Figure 34 - Sample program for the Ramsey RESET Test, in GAUSS-386

/**************************************************************
* PROGRAM: RESET.G SOFTWARE: GAUSS-386 V3.0
* FILENAME DESCRIPTION
* INPUTS: DATA.DAT GAUSS-386 DATA SET
* PURPOSE: PERFORM RAMSEY RESET TEST
***************************************************************/

FORMAT /M2 /RD 12,4;
OUTPUT FILE = RESET.OUT RESET;
NAMES = GETNAME("DATA");
OPEN D = DATA VARINDXI;
NCASE = ROWSF(D);
DATA = READR(D,NCASE);
F = CLOSE(D);
Y = DATA[.,IY1];

X = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15
ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3];

NAMES = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.];

@-------- OLS ESTIMATION OF "RESTRICTED" MODEL --------@

KR = COLS(X);

B = INV(X'X)*X'Y; @ BETAS @
YHAT = X*B; @ FITTED VALUES @
E = Y - YHAT; @ RESIDUALS @
RSSR = E'E; @ RESTRICTED RSS @
SER = SQRT(INV(NCASE-KR)*RSSR); @ STD ERR OF REGRESSION @
RSQ = 1 - RSSR/((NCASE-1)*(STDC(Y))^2); @ R-SQUARED @
COV = INV(NCASE-KR)*RSSR*INV(X'X); @ COV MATRIX OF BETAS @
SE = SQRT(DIAG(COV)); @ STD ERRS OF BETAS @
T = B ./ SE; @ T-STATISTICS @
PT = 2*CDFTC(ABS(T),(NCASE-KR)); @ P-VALUES @
PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @

" ";
" ";
" ";
" OLS RESULTS ";
" ";
" ";
" NUMBER OF OBSERVATIONS = ";; NCASE;
" ";
" STANDARD ERROR OF REGRESSION = ";; SER;
" ";
" RESIDUAL SUM OF SQUARES = ";; RSSR;
" ";
" R-SQUARED = ";; RSQ;
" ";
" ";
" VARIABLE COEFF STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= KR-1;

FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;

" ";
"f";

@-------- RESET VARIABLES --------@

Y2 = YHAT^2;
Y3 = YHAT^3;
Y4 = YHAT^4;

@-------- OLS ESTIMATION OF "UNRESTRICTED" REGRESSION --------@

X = X ~ Y2 ~ Y3 ~ Y4;

KU = COLS(X);

B = INV(X'X)*X'Y; @ BETAS @
YHAT = X*B; @ FITTED VALUES @
E = Y - YHAT; @ RESIDUALS @
RSSU = E'E; @ UNRESTRICTED RSS @
SER = SQRT(INV(NCASE-KU)*RSSU); @ STD ERR OF REGRESSION @
RSQ = 1 - RSSU/((NCASE-1)*(STDC(Y))^2); @ R-SQUARED @
COV = INV(NCASE-KU)*RSSU*INV(X'X); @ COV MATRIX OF BETAS @
SE = SQRT(DIAG(COV)); @ STD ERRS OF BETAS @
T = B ./ SE; @ T-STATISTICS @
PT = 2*CDFTC(ABS(T),(NCASE-KU)); @ P-VALUES @
PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @

" ";
" ";
" ";
" OLS RESULTS FOR UNRESTRICTED REGRESSION ";
" ";
" ";
" NUMBER OF OBSERVATIONS = ";; NCASE;
" ";
" STANDARD ERROR OF REGRESSION = ";; SER;
" ";
" RESIDUAL SUM OF SQUARES = ";; RSSU;
" ";
" R-SQUARED = ";; RSQ;
" ";
" ";
" VARIABLE COEFF STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= KU-4;

FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;
" ";
" Y2 ";; PRN[20,.];
" Y3 ";; PRN[21,.];
" Y4 ";; PRN[22,.];
" ";
" ";

@--- F-STAT FOR INCREMENTAL EXPLANATORY POWER OF RESET VARIABLES ---@

F = ( (RSSR-RSSU) / (KU - KR) )/( RSSU / (NCASE-KU) );

PROB = CDFFC(F,(KU-KR),(NCASE-KU));

" ";
" RESET TEST STATISTIC F =";; F;; " PROB =";; PROB;

"f";

OUTPUT FILE = RESET.OUT OFF;
SYSTEM;

Figure 35 - Sample program for the Ramsey RESET Test, in SAS PC

**************************************************************
* PROGRAM: RESET.SAS SOFTWARE: SAS PC 6.04
* FILENAME DESCRIPTION
* INPUTS: DATA.SSD TEST DATA SET
* PURPOSE: PERFORM RAMSEY RESET TEST.
*************************************************************;

LIBNAME CDRV 'C:DATA';

PROC REG DATA=CDRV.DATA;

MODEL Y1=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3;
OUTPUT OUT=HAT P=YHAT;

RUN;

DATA YHATX;

SET HAT;
YHAT2=YHAT**2;
YHAT3=YHAT**3;
YHAT4=YHAT**4;

RUN;

PROC REG DATA=YHATX;

MODEL Y1=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8

RD1 RD2 RD3 YHAT2 YHAT3 YHAT4;

FTEST : TEST YHAT2, YHAT3, YHAT4;

RUN;

* TEST STATISTIC CALCULATION FROM OUTPUT.
* CALCULATE F = ((RSSR-RSSU)/D)/(RSSU/(N-K)).
* WHERE RSSR IS THE REGRESSION SUM OF SQUARES FOR THE RESTRICTED EQUATION.
* RSSU IS THE REGRESSION SUM OF SQUARES FOR THE UNRESTRICTED EQUATION.
* N IS THE NUMBER OF CASES (1624).
* D IS THE NUMBER OF RESTRICTIONS (3).
* K IS THE NUMBER OF PARAMETERS IN UNRESTRICTED REGRESSION (22).
* IF YHAT2, YHAT3, AND YHAT4 ARE NOT JOINTLY SIGNIFICANT THEN THE
* NULL HYPOTHESIS OF OMITTED VARIABLES IS REJECTED (FTEST=0.6024);

Figure 36 - Sample program for the Ramsey RESET Test, in SPSS/PC+

SET MORE = OFF.
SET LIS = 'RESET.LIS'.
SET LOG = 'RESET.LOG'.
****************************************************************
* PROGRAM: RESET.SPS SOFTWARE: SPSS/PC+ 4.01
* FILENAME DESCRIPTION
* INPUTS: DATA.SYS TEST DATA SET
* PURPOSE: PERFORM RAMSEY RESET TEST.
***************************************************************.

GET FILE = 'DATA.SYS' .

* RESTRICTED REGRESSION.
REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6

D7 D8 RD1 RD2 RD3
/DEPENDENT=Y1
/METHOD=ENTER
/SAVE PRED(YHAT).

COMPUTE YHAT2 = YHAT**2.
COMPUTE YHAT3 = YHAT**3.
COMPUTE YHAT4 = YHAT**4.

* UNRESTRICTED REGRESSION.
REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8

RD1 RD2 RD3 YHAT2 YHAT3 YHAT4
/CRITERIA=TOLERANCE(.0000001)
/DEPENDENT=Y1
/METHOD=ENTER.

* THE LOW TOLERANCE CRITERIA IS EMPLOYED TO FORCE YHAT2 AND YHAT3 INTO
* THE EQUATION. SPSS WILL ISSUE A WARNING ABOUT THIS. SAS DOES NOT
* ISSUE A WARNING ABOUT THIS. THE F-TESTS FROM SPSS AND SAS ARE IDENTICAL.

* TEST STATISTIC CALCULATION FROM OUTPUT.
* CALCULATE F = ((RSSR-RSSU)/D)/(RSSU/(N-K)).
* WHERE RSSR IS THE REGRESSION SUM OF SQUARES FOR THE RESTRICTED EQUATION.
* RSSU IS THE REGRESSION SUM OF SQUARES FOR THE UNRESTRICTED EQUATION.
* N IS THE NUMBER OF CASES (1624).
* D IS THE NUMBER OF RESTRICTIONS (3).
* K IS THE NUMBER OF PARAMETERS IN UNRESTRICTED REGRESSION (22).
* IF YHAT2, YHAT3, AND YHAT4 ARE NOT JOINTLY SIGNIFICANT THEN THE
* NULL HYPOTHESIS OF OMITTED VARIABLES IS REJECTED (FTEST=0.6024).
FINISH.

MULTI-COLLINEARITY DIAGNOSTICS

Multicollinearity exists when there is a linear relationship among some subset of regressors in a model. Multicollinearity exists in virtually every data set but is a problem only when the linear relationship among regressors is very strong. The main effects of high multicollinearity are that the variances of the estimated coefficients are inflated and the t-statistics are consequently small; and, in extreme cases, the coefficients may be very sensitive and unstable with respect to minor changes in model specification and data.

Since multicollinearity is essentially a matter of degree, attention has focused on descriptions of its extent and on assessments of the extent to which it inflates the variances of the coefficients. Two popular methods for assessing the strength of multicollinearity are discussed below.

Auxiliary Regressions

This is more useful than the popular method of simply looking at the correlation matrix of regressors, since the latter only reveals pair-wise relationships between variables. The auxiliary regression method makes use of the fact that the R2 statistic is a measure of the extent to which one variable is a linear combination of a set of other variables. The strategy is to regress each continuous regressor, in turn, on all remaining regressors and to check the R2 of each auxiliary regression. High R2 values indicate the existence of strong linear dependencies. If only one linear relationship is very strong, then it provides an indication of which variable is suspect. However, if more than one linear dependency is strong, then the multicollinearity is more generally distributed among the regressors.

The steps for performing auxiliary regressions and interpreting their results are described below.

Step 1

Specify the first explanatory variable as the dependent variable and perform OLS, using the remainder of the explanatory variables (including a constant) as regressors.

Step 2

Calculate R2 for this regression. A high R2 (one rule of thumb might be approximately 0.90 or above) indicates that the first explanatory variable is a strong linear function of the remaining explanatory variables. This general rule of thumb should be used as a benchmark, not as a strict bound.

Step 3

Repeat steps 1 and 2 for each of the continuous explanatory variables in turn.

For the eight continuous regressors in the standard model in the sample programs (Figures 37 through 39), the R2 values for the auxiliary regressions range from 0.0895 to 0.8241. Therefore, it is concluded that multicollinearity is not severe.

Recommended references: Fomby, Hill, and Johnson (1984, 293-294); Greene (1990, 277-281); Griffiths, Hill, and Judge (1993, 436-437); Judge et al. (1984, 902-904); Kennedy (1985, 150, 153; 1992, 179-180, 183-184).

Figure 37 - Sample program for performing auxiliary regressions, in GAUSS-386

/***************************************************************************
* PROGRAM: AUXREG.G SOFTWARE: GAUSS-386 V3.0
* FILENAME DESCRIPTION
* INPUTS: DATA.DAT GAUSS-386 DATA SET
* PURPOSE: EXECUTE AND REPORT AUXILIARY REGRESSIONS
* TO CHECK FOR MULTICOLLINEARITY
***************************************************************************/

FORMAT /M2 /RD 12,4;
OUTPUT FILE = AUXREG.OUT RESET;
NAMES = GETNAME("DATA");
OPEN D = DATA.DAT VARINDXI;
NCASE = ROWSF(D);
DATA = READR(D,NCASE);
F = CLOSE(D);

X = DATA[.,IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3];

K = COLS(X);

NAMES = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.];

" ";
" ";
" ";
" AUXILIARY DEPENDENT ";
" REGRESSION VARIABLE R-SQUARED ";
" ";

I = 1;
DO WHILE I <= K;

XA = X[.,I];

IF I == 1;

XX = X[.,2:K];

ENDIF;

IF I >= 2 AND I <= (K-1);

XX = X[.,1:(I-1)] ~ X[.,(I+1):K];

ENDIF;

IF I == K;

XX = X[.,1:(K-1)];

ENDIF;

XX = ONES(NCASE,1) ~ XX;

@-------- OLS ESTIMATION OF AUXILIARY REGRESSION --------@

KA = COLS(XX);

B = INV(XX'XX)*XX'XA; @ BETAS @
E = XA - XX*B; @ RESIDUALS @
RSS = E'E; @ RESIDUAL SUM OF SQ @
SER = SQRT(INV(NCASE-KA)*RSS); @ STD ERR OF REGRESSION @
RSQ = 1 - RSS/((NCASE-1)*(STDC(XA))^2); @ R-SQUARED @
COV = INV(NCASE-K)*RSS*INV(XX'XX); @ OLS COVARIANCE MATRIX @
SE = SQRT(DIAG(COV)); @ STD ERRS OF BETAS @
T = B ./ SE; @ T-STATISTICS FOR BETAS @
PT = CDFTC(ABS(T),(NCASE-K)); @ P-VALUES @

PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @

FORMAT /M2 /RD 8,0; I;; FORMAT /M2 /RD 14,8; $NAMES[I,.];;
FORMAT /M2 /RD 12,4; RSQ;

I = I + 1;


ENDO;

"f";

OUTPUT FILE = AUXREG.OUT OFF;
SYSTEM;

Figure 38 - Sample program for performing auxiliary regressions, in SAS PC

****************************************************************************
* PROGRAM: AUXREG.SAS SOFTWARE: SAS PC 6.04
* FILENAME DESCRIPTION
* INPUTS: DATA.SSD TEST DATA SET
* PURPOSE: EXECUTE AND REPORT AUXILIARY REGRESSIONS
* TO CHECK FOR MULTICOLLINEARITY.
***************************************************************************;

LIBNAME CDRV 'C:DATA';

* THIS IS THE MODEL TO BE ESTIMATED;
PROC REG DATA=CDRV.DATA;
MM: MODEL Y1=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3;

* BELOW ARE THE 8 AUXILIARY REGRESSIONS FOR THE CONTINUOUS VARIABLES;

M1: MODEL X1=X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3;
M2: MODEL X2=X1 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3;
M3: MODEL X8=X1 X2 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3;
M4: MODEL X9=X1 X2 X8 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3;
M5: MODEL X10=X1 X2 X8 X9 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3;
M6: MODEL X13=X1 X2 X8 X9 X10 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3;
M7: MODEL X14=X1 X2 X8 X9 X10 X13 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3;
M8: MODEL X15=X1 X2 X8 X9 X10 X13 X14 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3;
RUN;
* TEST STATISTIC CALCULATION FROM OUTPUT.
* OBSERVE R-SQUARED IN EACH REGRESSION. ONE RULE OF THUMB IS THAT AN
* R-SQUARED VALUE OF 0.9 OR HIGHER INDICATED SERIOUS COLLINEARITY. THIS
* IS A GENERAL RULE OF THUMB, NOT A STRICT BOUND. NONE OF THE 8
* AUXILLARY REGRESSIONS IN THIS EXAMPLE HAS AN R-SQ ABOVE 0.9;

Figure 39 - Sample program for performing auxiliary regressions, in SPSS/PC+

SET MORE = OFF.
SET LIS='AUXREG.LIS'.
SET LOG='AUXREG.LOG'.
****************************************************************************
* PROGRAM: AUXREG.SPS SOFTWARE: SPSS/PC+ 4.01
* FILENAME DESCRIPTION
* INPUTS: MANUAL.SYS TEST DATA SET
* PURPOSE: EXECUTE AND REPORT AUXILIARY REGRESSIONS
* TO CHECK FOR MULTICOLLINEARITY.
***************************************************************************.

GET FILE = 'DATA.SYS'.
* THIS IS THE MODEL TO BE ESTIMATED.
REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8

RD1 RD2 RD3
/DEPENDENT=Y1
/METHOD=ENTER.

* BELOW ARE THE 8 AUXILIARY REGRESSIONS FOR THE CONTINUOUS VARIABLES.
REGRESSION VARIABLES = X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8
RD1 RD2 RD3
/DEPENDENT=X1
/METHOD=ENTER.

REGRESSION VARIABLES = X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8
RD1 RD2 RD3
/DEPENDENT=X2
/METHOD=ENTER.

REGRESSION VARIABLES = X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8
RD1 RD2 RD3
/DEPENDENT=X8
/METHOD=ENTER.

REGRESSION VARIABLES = X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8
RD1 RD2 RD3
/DEPENDENT=X9
/METHOD=ENTER.

REGRESSION VARIABLES = X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8
RD1 RD2 RD3
/DEPENDENT=X10
/METHOD=ENTER.

REGRESSION VARIABLES = X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8
RD1 RD2 RD3
/DEPENDENT=X13
/METHOD=ENTER.

REGRESSION VARIABLES = X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8
RD1 RD2 RD3
/DEPENDENT=X14
/METHOD=ENTER.

REGRESSION VARIABLES = X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8
RD1 RD2 RD3
/DEPENDENT=X15
/METHOD=ENTER.

* TEST STATISTIC CALCULATION FROM OUTPUT.
* OBSERVE R-SQUARED IN EACH REGRESSION. ONE RULE OF THUMB IS THAT AN
* R-SQUARED VALUE OF 0.9 OR HIGHER INDICATED SERIOUS COLLINEARITY. THIS
* IS A GENERAL RULE OF THUMB, NOT A STRICT BOUND. NONE OF THE 8
* AUXILLIARY REGRESSIONS IN THIS EXAMPLE HAS AN R-SQ. ABOVE 0.9.
FINISH.

Condition Indices and the Condition Number

Strong multicollinearity among the regressors implies that at least one eigenvalue or characteristic root of the (X’X) matrix is small. Condition indices are the square roots of the ratios of the largest eigenvalue of the standardized (X’X) matrix to the remaining eigenvalues. The condition number is the largest of these values, that is, the square root of the ratio of the largest to the smallest eigenvalue. SAS PC and SPSS/PC+ both produce multicollinearity diagnostics based on condition indices as options of their regression routines. It is also easy to produce them in GAUSS-386. The steps described below may be followed in GAUSS-386.

Step 1

Compute the square roots of the diagonal elements of (X’X). Use these to form a diagonal matrix (zeros except on the diagonal), then invert the diagonal matrix and call this result S.

Step 2

Form the K × K matrix Z = SX’XS.

Step 3

Calculate the vector l containing the K eigenvalues of Z; identify the smallest one as lmin and the largest one as lmax.

Step 4

Compute the vector of condition indices C as follows:

C=(lmax/l)1/2.

The largest of these indices is the condition number.

Extensive experimentation conducted by Belsley, Kuh, and Welsch (1980) suggests that condition indices in excess of 30 indicate the presence of multicollinearity; condition indices in excess of a few hundred indicate severe multicollinearity. In the sample programs, three condition indices are larger than 30 and one is greater than 100, which is consistent with the results of the auxiliary regressions—multicollinearity is moderate. Figures 40 through 42 are sample programs for determining the condition number.

NOTE: Belsley, Kuh, and Welsch (1980) present measures that describe the extent to which variances of estimated coefficients may be inflated because of the presence of multicollinearity; they also present measures to identify which regressors are most problematic. SPSS/PC+ and SAS PC have a preprogrammed option called Variance Decomposition Proportion, which helps to identify the variables that are involved in multicollinearity.

Recommended references: Belsley, Kuh, and Welsch (1980, chapter 3); Corlett (1990, 158-159); Greene (1990, 281); Johnston (1984, 249-250); Judge et al. (1984, 902, 914, 920); Kennedy (1985, 150, 153; 1992, 180, 183); Kmenta (1986, 439); Maddala (1988, 228).

Figure 40 - Sample program for determining the condition number, in GAUSS-386

/***************************************************************************
* PROGRAM: CONDNUM.G SOFTWARE: GAUSS-386 V3.0
* FILENAME DESCRIPTION
* INPUTS: DATA.DAT GAUSS-386 DATA SET
* PURPOSE: COMPUTE REGRESSION RESULTS AND PRODUCE
* MULTICOLLINEARITY DIAGNOSTICS
***************************************************************************/

FORMAT /M2 /RD 12,4;

OUTPUT FILE = CONDNUM.OUT ON;

NAMES = GETNAME("DATA");
OPEN D = DATA VARINDXI;
NCASE = ROWSF(D);
DATA = READR(D,NCASE);
F = CLOSE(D);

Y = DATA[.,IY1];

X = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3];

NAMES = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.];

@-------- OLS ESTIMATION --------@

K = COLS(X);

B = INV(X'X)*X'Y; @ BETAS @
E = Y - X*B; @ RESIDUALS @
RSS = E'E; @ RESIDUAL SUM OF SQ @
SER = SQRT(INV(NCASE-K)*RSS); @ STD ERR OF REGRESSION @
RSQ = 1 - RSS/((NCASE-1)*(STDC(Y))^2); @ R-SQUARED @
COV = INV(NCASE-K)*RSS*INV(X'X); @ OLS COVARIANCE MATRIX @
SE = SQRT(DIAG(COV)); @ STD ERRS OF BETAS @
T = B ./ SE; @ T-STATISTICS FOR BETAS @
PT = 2*CDFTC(ABS(T),(NCASE-K)); @ P-VALUES @

PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @

" ";
" ";
" ";
" OLS RESULTS ";
" ";
" ";
" NUMBER OF OBSERVATIONS = ";; NCASE;
" STANDARD ERROR OF REGRESSION = ";; SER;
" RESIDUAL SUM OF SQUARES = ";; RSS;
" R-SQUARED = ";; RSQ;
" ";
" ";
" VARIABLE COEFF STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K -1;

FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;
" ";

@------- FORM SCALED VERSION OF (X'X) --------@

D = SQRT(DIAG(X'X));
S = INV(DIAGRV(EYE(K),D));
Z = S*X'X*S;

@-------- COMPUTE EIGENVALUES OF Z --------@

L = EIGRS(Z);

LMIN = MINC(L);
LMAX = MAXC(L);

CONDINDX = SQRT(LMAX./L);
COND = SQRT(LMAX/LMIN);

" ";
" ";
" CONDITION INDICES ";
" ";
CONDINDX;
" ";
" ";
" CONDITION NUMBER: C =";; COND;

"f";

OUTPUT FILE = CONDNUM.OUT OFF;
SYSTEM;

Figure 41 - Sample program for determining the condition number, in SAS PC

****************************************************************************
* PROGRAM: CONDNUM.SAS SOFTWARE: SAS PC 6.04
* FILENAME DESCRIPTION
* INPUTS: DATA.SSD TEST DATA SET
* PURPOSE: COMPUTE REGRESSION RESULTS AND PRODUCE
* MULTICOLLINEARITY DIAGNOSTICS
***************************************************************************;

LIBNAME CDRV 'C:DATA';

* THIS IS THE MODEL TO BE ESTIMATED;
PROC REG DATA=CDRV.DATA;

MODEL Y1=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3

/ COLLIN;

RUN;

* TEST STATISTIC CALCULATION FROM OUTPUT.
* THE CONDITION INDEX IS AUTOMATICALLY CALCULATED IN SAS IF THE COLLIN
* OPTION IS USED. THREE OF THE CONDITION NUMBERS ARE LARGER
* THAN THE RULE-OF-THUMB CUTOFF OF 30. THE VARIABLES MOST
* RESPONSIBLE FOR THE LARGE CONDITION NUMBERS SEEM TO BE X1 AND X2.;

Figure 42 - Sample program for determining the condition number, in SPSS/PC+

SET MORE OFF.
SET LIS = 'CONDNUM.LIS'.
SET LOG = 'CONDNUM.LOG'.
****************************************************************************
* PROGRAM: CONDNUM.SPS SOFTWARE: SPSS/PC+ 4.01
* FILENAME DESCRIPTION
* INPUTS: DATA.SYS TEST DATA SET
* PURPOSE: COMPUTE REGRESSION RESULTS AND PRODUCE
* MULTICOLLINEARITY DIAGNOSTICS
***************************************************************************.

GET FILE = 'DATA.SYS' .
* THIS IS THE MODEL TO BE ESTIMATED.
REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8

RD1 RD2 RD3
/STATISTICS = COLLIN
/DEPENDENT=Y1
/METHOD=ENTER.

* TEST STATISTIC CALCULATION FROM OUTPUT.
* THE CONDITION INDEX FOR EACH EIGEN VALUE IS AUTOMATICALLY
* CALCULATED IN SPSS IF THE COLLIN OPTION IS USED. THREE OF THE CONDITION
* NUMBERS ARE LARGER THAN THE RULE-OF-THUMB CUTOFF OF 30. THE VARIABLES
* MOST RESPONSIBLE FOR THE LARGE CONDITION NUMBERS SEEM TO BE X1 AND X2.
FINISH.

TESTING FOR STRUCTURAL CHANGE

The Chow F-Test

The Chow F-test, more commonly known as the “Chow test,” is a simple way to test if the underlying parameter values for a data set change across specified subsets of that data: across different time periods or household types, for example. The Chow test compares the RSS from a restricted model (that assumes that the parameters are constant across data subsets) with the RSS from an unrestricted model (that allows the parameters to vary across data subsets). The unrestricted RSS may be obtained by running separate regressions for the data subsets and summing the resulting RSSs or, alternatively, by running a single regression that includes a set of dummy and dummy-interaction variables that distinguish among the subsets of the data. Both methods are simple and they have identical results. Both are presented below, in GAUSS-386. In SAS PC and SPSS/PC+, only the second approach is presented. For the programs discussed here, the question of whether the data from “round 1” surveys are distinct from the data drawn from the other three rounds is investigated.

This example is slightly more complicated to program than typical examples of the Chow test because of the presence of two dummies to distinguish among the three rounds in the second data subset. In effect, distinct intercepts for all survey rounds are permitted, and this example only tests whether slope coefficients are distinct between round 1 and the other three rounds. The models used in this example are as follows:

· Round 1 model (RD2 = RD3 = RD4 = 0):


Y=b0+Xb+e

where X contains neither an intercept nor any “round” dummies.

· Rounds 2 through 4 model (RD1 = 0):


Y=b0+Xb+dRD3+d4RD4+e,

where X is as described in the round 1 model, and RD3 and RD4 introduce intercept differentials for the third and fourth rounds.

Note that RD4 is not contained in the data set, but can be constructed from knowledge of RD1, RD2, and RD3.

· Restricted model (only intercepts allowed to vary):

Y=b0+Xb+d2RD2+dRD3+d4RD4+e

First Approach: Estimating Separate Models for Two Data Subsets (GAUSS-386). In the first approach, the data are split into subsets and a separate model is estimated from each:

Step 1

Separate the data into two data subsets: one from the first round of the survey (RD1 = 1) and one from the other rounds (RD1 = 0).

Step 2

Run three regressions:

First: Estimate the Round 1 model for the data set for which RD1 = 1 and retain the RSS. Call it RSS1.


Second: Estimate the Round 2 through 4 model for the data set for which RD1 = 0 and retain the RSS. Call it RSS2.


Third: Estimate the restricted model for the full data set and retain the RSS. Call it RSSR for “restricted” RSS.

Step 3

The unrestricted RSS is RSSU = RSS1 + RSS2

Step 4

Form the test statistic


Here, the numerator degrees of freedom is equal to the number of restrictions (the number of slope coefficients that are forced to be equal across the two models equals 15 in the sample programs) and the denominator degrees of freedom is equal to the degrees of freedom associated with the unrestricted model (sample size minus the total number of coefficients estimated in the unrestricted model[s]).

Second Approach: Dummy Variables (GAUSS-386, SAS PC, and SPSS/PC+ programs). In the second approach, dummy variables are used:

Step 1

Let RD1 be the dummy variable that identifies the first-round survey observations. Form the matrix of interaction variables DX = RD1.*X, where .* is element-by-element multiplication of each row in X by corresponding elements of RD1 (15 rows in the sample programs).

Step 2

Estimate the unrestricted model by OLS:

y=b0+Xb+DXd+d2RD2+d3RD3+d4RD4+e

This is the unrestricted model, because the presence of the dummy interaction variables allows differential effects across subsamples for all slope coefficients.

Step 3

Estimate the restricted model by OLS:

y=b0+Xb+d2RD2+d3RD3+d4RD4+e

Comparing the restricted and unrestricted models, it is evident that the hypothesis to be tested is

H0: d=0, and H1: d ¹0.

Step 4

Compute the test statistic exactly as in step 4 above.

Both approaches to the test produce an F-statistic of 1.191 (df1, df2) = (15,190), hence the null hypothesis of equal slope coefficients in round 1 versus rounds 2 through 4 (no structural change) cannot be rejected.

The Chow test is applicable to a wide variety of hypotheses; this example shows only one case. Refer to the references for additional applications. Figures 43 through 45 are sample programs for the Chow test.

Recommended references: Chow (1960, 591-605); Fomby, Hill, and Johnson (1984, 197-199); Greene (1990, 218-222); Johnston (1984, 207-225); Kennedy (1985, 87-88, 186; 1992, 98, 108-109); Kmenta (1986, 420-422); Maddala (1988, 134).

Figure 43 - Sample program for Chow test, in GAUSS-386

/************************************************************************
* PROGRAM: CHOW.G SOFTWARE: GAUSS-386 V3.0
* FILENAME DESCRIPTION
* INPUTS: DATA.DAT GAUSS-386 DATA SET
* PURPOSE: ILLUSTRATE TWO APPROACHES TO CHOW TEST
************************************************************************/

FORMAT /M2 /RD 12,4;
OUTPUT FILE = CHOW.OUT RESET;

NAMES = GETNAME("DATA");
OPEN D = DATA VARINDXI;
NCASE = ROWSF(D);
DATA = READR(D,NCASE);
F = CLOSE(D);

Y = DATA[.,IY1];

X = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3];

RD1 = DATA[.,IRD1];

NAMES = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.];

@-------- FIRST APPROACH: RESTRICTED REGRESSION --------@
@-------- AND TWO SUBSET REGRESSIONS FOR THE --------@
@-------- UNRESTRICTED CASE --------@

@-------- RESTRICTED REGRESSION --------@

K = COLS(X);
B = INV(X'X)*X'Y; @ BETAS @
E = Y - X*B; @ RESIDUALS @
RSSR = E'E; @ RESTRICTED RSS @
SER = SQRT(INV(NCASE-K)*RSSR); @ STD ERR OF REGRESSION @
RSQ = 1 - RSSR/((NCASE-1)*(STDC(Y))^2); @ R-SQUARED @
COV = INV(NCASE-K)*RSSR*INV(X'X); @ OLS COVARIANCE MATRIX @
SE = SQRT(DIAG(COV)); @ STD ERRS OF BETAS @
T = B ./ SE; @ T-STATISTICS FOR BETAS @
PT = 2*CDFTC(ABS(T),(NCASE-K)); @ P-VALUES @

PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @

" ";
" ";
" ";
" RESTRICTED REGRESSION RESULTS ";
" ";
" ";
" NUMBER OF OBSERVATIONS = ";; NCASE;
" STANDARD ERROR OF REGRESSION = ";; SER;
" RESIDUAL SUM OF SQUARES = ";; RSSR;
" R-SQUARED = ";; RSQ;
" ";
" ";
" VARIABLE COEFF STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K -1;

FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;
" ";
"f";

@-------- REGRESSION ON FIRST-ROUND (RD1 = 1) SUBSET -------@

Y1 = SELIF(Y,RD1);
X1 = SELIF(X,RD1);
N1 = ROWS(X1);
X1 = X1[.,1:(K-3)];
K = COLS(X1);

B = INV(X1'X1)*X1'Y1; @ BETAS @
E = Y1 - X1*B; @ RESIDUALS @
RSS1 = E'E; @ UNRESTRICTED RSS1 @
SER = SQRT(INV(N1-K)*RSS1); @ STD ERR OF REGRESSION @
RSQ = 1 - RSS1/((N1-1)*(STDC(Y1))^2); @ R-SQUARED @
COV = INV(N1-K)*RSS1*INV(X1'X1); @ OLS COVARIANCE MATRIX @
SE = SQRT(DIAG(COV)); @ STD ERRS OF BETAS @
T = B ./ SE; @ T-STATISTICS FOR BETAS @
PT = 2*CDFTC(ABS(T),(N1-K)); @ P-VALUES @
PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @

" ";
" ";
" ";
" REGRESSION RESULTS FOR FIRST ROUND SUBSET (RD1 = 1) ";
" ";
" ";
" NUMBER OF OBSERVATIONS = ";; N1;
" STANDARD ERROR OF REGRESSION = ";; SER;
" RESIDUAL SUM OF SQUARES = ";; RSS1;
" R-SQUARED = ";; RSQ;
" ";
" ";
" VARIABLE COEFF STD ERROR T-RATIO P-VALUE";
" ";

" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K - 1;

FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;
" ";
"f";

@-------- REGRESSION ON NON-FIRST-ROUND DATA --------@

Y2 = DELIF(Y,RD1);
X2 = DELIF(X,RD1);
N2 = ROWS(X2);

X2 = X2[.,1:K K+2 K+3];

NAME2 = NAMES[1:(K-1) K+1 K+2,.];

K = COLS(X2);
B = INV(X2'X2)*X2'Y2; @ BETAS @
E = Y2 - X2*B; @ RESIDUALS @
RSS2 = E'E; @ UNRESTRICTED RSS2 @
SER = SQRT(INV(N2-K)*RSS2); @ STD ERR OF REGRESSION @
RSQ = 1 - RSS2/((N2-1)*(STDC(Y2))^2); @ R-SQUARED @
COV = INV(N2-K)*RSS2*INV(X2'X2); @ OLS COVARIANCE MATRIX @
SE = SQRT(DIAG(COV)); @ STD ERRS OF BETAS @
T = B ./ SE; @ T-STATISTICS FOR BETAS @
PT = 2*CDFTC(ABS(T),(N2-K)); @ P-VALUES @
PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @

" ";
" ";
" ";
" REGRESSION RESULTS FOR NON-FIRST-ROUND SUBSET ";
" ";
" ";
" NUMBER OF OBSERVATIONS = ";; N2;
" STANDARD ERROR OF REGRESSION = ";; SER;
" RESIDUAL SUM OF SQUARES = ";; RSS2;
" R-SQUARED = ";; RSQ;
" ";
" ";
" VARIABLE COEFF STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K - 1;

FORMAT /M1 /RD 12,8; $NAME2[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;
" ";

RSSU = RSS1 + RSS2;
DFN = COLS(X) - 4;
DFD = NCASE - (2*DFN + 4);

F = ( (RSSR - RSSU)/DFN ) / (RSSU/DFD);

PROBF = CDFFC(F,DFN,DFD);

" ";
" ";
" ";

" RESULTS FOR SUBSET REGRESSION APPROACH";
" ";
" ";
" CHOW TEST: F =";; F;; " P-VALUE =";; PROBF;
" ";
" NUMERATOR DF =";; DFN;
" DENOMINATOR DF =";; DFD;

"f";

@-------- SECOND APPROACH: RESTRICTED REGRESSION --------@
@-------- AND DUMMY-VARIABLE REGRESSION --------@
@-------- FOR UNRESTRICTED CASE --------@

K = COLS(X);

DX = RD1 .* X[.,2:(K-3)];

NAMES = NAMES

| "DX1" | "DX2" | "DX8" | "DX9" | "DX10" | "DX13" |
"DX14" | "DX15" | "DD1" | "DD2" | "DD3" | "DD5" |
"DD6" | "DD7" | "DD8";

X = X ~ DX;

K = COLS(X);

@-------- UNRESTRICTED DUMMY-VARIABLE REGRESSION --------@

B = INV(X'X)*X'Y; @ BETAS @
E = Y - X*B; @ RESIDUALS @
RSSU = E'E; @ UNRESTRICTED RSS @
SER = SQRT(INV(NCASE-K)*RSSU); @ STD ERR OF REGRESSION @
RSQ = 1 - RSSU/((NCASE-1)*(STDC(Y))^2); @ R-SQUARED @
COV = INV(NCASE-K)*RSSU*INV(X'X); @ OLS COVARIANCE MATRIX @
SE = SQRT(DIAG(COV)); @ STD ERRS OF BETAS @
T = B ./ SE; @ T-STATISTICS FOR BETAS @
PT = 2*CDFTC(ABS(T),(NCASE-K)); @ P-VALUES @

PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @

" ";
" ";
" ";
" UNRESTRICTED DUMMY-VARIABLE REGRESSION RESULTS ";
" ";
" ";
" NUMBER OF OBSERVATIONS = ";; NCASE;
" STANDARD ERROR OF REGRESSION = ";; SER;
" RESIDUAL SUM OF SQUARES = ";; RSSU;
" R-SQUARED = ";; RSQ;
" ";
" ";
" VARIABLE COEFF STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K - 1;

FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;
" ";

DFN = (K-4)/2;
DFD = NCASE - K;

F = ( (RSSR - RSSU)/DFN ) / (RSSU/DFD);

PROBF = CDFFC(F,DFN,DFD);

" ";
" ";
" ";

" RESULTS FOR DUMMY-VARIABLE APPROACH";
" ";
" ";
" CHOW TEST: F =";; F;; " P-VALUE =";; PROBF;
" ";
" ";
" NUMERATOR DF =";; DFN;
" DENOMINATOR DF =";; DFD;
" ";
"f";

OUTPUT FILE = CHOW.OUT OFF;
SYSTEM;

Figure 44 - Sample program for Chow test, in SAS PC

********************************************************************************
* PROGRAM: CHOW.SAS SOFTWARE: SAS PC 6.04
* FILENAME DESCRIPTION
* INPUTS: DATA.SSD TEST DATA SET
* PURPOSE: ILLUSTRAITE TWO APPROACHES TO CHOW TEST.
*******************************************************************************;

* THE NULL HYPOTHESIS BEING TESTED IS THAT THE SLOPE COEFFICIENTS ON
* THE EXPLANATORY VARIABLES ARE IDENTICAL IN ROUND 1 VERSUS ROUNDS 2-4.
* THE INTERCEPT IS ALLOW TO VARY BY ROUND EVEN IN THE RESTRICTED MODEL.;

LIBNAME CDRV 'C:DATA';

DATA DAT2;

SET CDRV.DATA;

DX1 = RD1*X1;
DX2 = RD1*X2;
DX8 = RD1*X8;
DX9 = RD1*X9;
DX10= RD1*X10;
DX13= RD1*X13;
DX14= RD1*X14;
DX15= RD1*X15;
DD1 = RD1*D1;
DD2 = RD1*D2;
DD3 = RD1*D3;
DD5 = RD1*D5;
DD6 = RD1*D6;
DD7 = RD1*D7;
DD8 = RD1*D8;

RUN;

PROC REG DATA=DAT2;

MODEL Y1= X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8

RD1 RD2 RD3
DX1 DX2 DX8 DX9 DX10 DX13 DX14 DX15
DD1 DD2 DD3 DD5 DD6 DD7 DD8;

B1 : TEST DX1=DX2=DX8=DX9=DX10=DX13=DX14=DX15=

DD1=DD2=DD3=DD5=DD6=DD7=DD8=0;

RUN;

* THE F-TEST STATISTIC IS CALCULATED FROM THE "B1: TEST" COMMAND;
* FOR THIS EXAMPLE, F-TEST = 1.1913 (DF=15, 1590). WE CANNOT REJECT THE NULL
* HYPOTHESIS THAT THE SLOPE COEFFICIENTS ARE IDENTICAL IN THE TWO TIME PERIODS.;

Figure 45 - Sample program for Chow test, in SPSS/PC+

SET MORE = OFF.
SET LIS = 'CHOW.LIS'.
SET LOG = 'CHOW.LOG'.
****************************************************************************
* PROGRAM: CHOW.SPS SOFTWARE: SPSS/PC+ 4.01
* FILENAME DESCRIPTION
* INPUTS: DATA.SYS TEST DATA SET
* PURPOSE: ILLUSTRATE TWO APPROACHES TO CHOW TEST
***************************************************************************.

* THE NULL HYPOTHESIS BEING TESTED IS THAT THE SLOPE COEFFICIENTS ON
* THE EXPLANATORY VARIABLES ARE IDENTICAL IN ROUND 1 VERSUS ROUNDS 2-4.
* THE INTERCEPT IS ALLOW TO VARY BY ROUND EVEN IN THE RESTRICTED MODEL.

GET FILE = 'DATA.SYS' .

COMPUTE DX1 = RD1*X1.
COMPUTE DX2 = RD1*X2.
COMPUTE DX8 = RD1*X8.
COMPUTE DX9 = RD1*X9.
COMPUTE DX10= RD1*X10.
COMPUTE DX13= RD1*X13.
COMPUTE DX14= RD1*X14.
COMPUTE DX15= RD1*X15.
COMPUTE DD1 = RD1*D1.
COMPUTE DD2 = RD1*D2.
COMPUTE DD3 = RD1*D3.
COMPUTE DD5 = RD1*D5.
COMPUTE DD6 = RD1*D6.
COMPUTE DD7 = RD1*D7.
COMPUTE DD8 = RD1*D8.

*UNRESTRICTED MODEL.
REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8

RD1 RD2 RD3
DX1 DX2 DX8 DX9 DX10 DX13 DX14 DX15
DD1 DD2 DD3 DD5 DD6 DD7 DD8
/DEPENDENT=Y1
/METHOD=ENTER.

*RESTRICTED MODEL.
REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8

RD1 RD2 RD3
/DEPENDENT=Y1
/METHOD=ENTER.

* TEST STATISTIC CALCULATION FROM OUTPUT.
* CALCULATE FTEST = ((RSSR-RSSU)/D)/(RSSU/(N-K)).
* WHERE RSSR IS THE RESIDUAL SUM OF SQUARES FOR THE RESTRICTED EQUATION.
* RSSU IS THE RESIDUAL SUM OF SQUARES FOR THE UNRESTRICTED EQUATION.
* N IS THE NUMBER OF CASES (FOR THIS EXAMPLE 1624).
* D IS THE NUMBER OF RESTRICTIONS (FOR THIS EXAMPLE 15).
* K IS THE NUMBER OF PARAMETERS IN THE UNRESTRICTED MODEL
* (FOR THIS EXAMPLE 34).
* FOR THIS EXAMPLE, FTEST=1.1913 (DF=15, 1590). WE CANNOT REJECT
* THE NULL HYPOTHESIS.
FINISH.

TESTING FOR NONLINEAR VARIABLES

The “linearity” assumption of the Classical Linear Regression Model refers to the assumption that the parameters enter the equation linearly. No such assumption is required concerning the manner in which the variables enter the equation. However, it is common to specify that the variables enter linearly. If this is inappropriate, then the consequences are similar to other forms of misspecification, such as the omission of relevant explanatory variables. In fact, if the Taylor theorem is used, inappropriate functional forms may be viewed as a special case of the omitted variables problem (Kmenta 1986, 449-451). Because of the similarity of the two problems, test results that indicate inappropriate functional form may actually be revealing an omitted variable problem. One test that is less susceptible to this problem is Utts’ Rainbow test.

Utts’ Rainbow Test

This test is related to the Chow test for structural stability, with the sample divided into two subsamples according to the observations’ influence (or leverage) on the regression results. If observations with high leverage displace the regression results significantly, then it may be concluded that the specification of the regression function is inadequate. The test makes use of a measure of leverage that is also used to detect influential outliers in a regression.

The model is the standard one:

y=Xb+e

The test is based on the difference in the RSS from the restricted regression (same model applies to all observations) and the RSS from the unrestricted regression (on observations that have small leverage). The null hypothesis is that this difference is zero. Keep in mind that this test assumes that the stochastic disturbance terms satisfy the classical assumptions. If they do not, then the test is not valid. Here, proceed under the assumption that the classical assumptions are satisfied.

Step 1

Perform OLS on the full data set and retain the residual sum of squares RSSR (restricted RSS).

Step 2

Compute the leverage measure for each observation in X:


where xi is the ith row of X. Sort the leverage measures into ascending order and select the half that are smallest. Identify observations in X and Y that correspond with the small leverage measures.

Step 3

Perform OLS on the subsample selected in step 2, and retain the residual sum of squares RSSU (unrestricted).

Step 4

Calculate the statistic U:


where K = the number of estimated coefficients.

A rejection of the null hypothesis implies that the functional form is inadequate. For these sample programs, U = 1.195 (F-critical = 1, P-value = 0.0058), so that the null hypothesis is rejected. Recall, however, that this model omits X3, X6, X7, and X12, and that heteroskedasticity afflicts the disturbances. An improved test would be to include the additional variables known to be significant and to correct for heteroskedasticity before conducting the Rainbow test. Figures 46 through 48 are sample programs for Utts’ Rainbow test.

Recommended references: Kennedy (1992, 104); Kmenta (1986, 454 - 455); Krr et al. (1985, 120-121); Utts (1982, 2801-2815).

Figure 46 - Sample program for Utts’ Rainbow test, in GAUSS-386

/***********************************************************************
* PROGRAM: RAINBOW.G SOFTWARE: GAUSS-386 3.0
* FILENAME DESCRIPTION
* INPUTS: DATA.DAT GAUSS-386 DATA SET
* PURPOSE: EXECUTE AND REPORT UTTS' RAINBOW TEST
* FOR ADEQUACY OF FUNCTIONAL FORM.
***********************************************************************/

FORMAT /M2 /RD 12,4;
OUTPUT FILE = RAINBOW.OUT RESET;

NAMES = GETNAME("DATA");
OPEN D = DATA VARINDXI;
NCASE = ROWSF(D);
DATA = READR(D,NCASE);
F = CLOSE(D);

Y = DATA[.,IY1];

X = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3];

NAMES = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.];

@-------- OLS ESTIMATION --------@

K = COLS(X);
B = INV(X'X)*X'Y; @ BETAS @
E = Y - X*B; @ RESIDUALS @
RSS = E'E; @ RESIDUAL SUM OF SQ @
SER = SQRT(INV(NCASE-K)*RSS); @ STD ERR OF REGRESSION @
RSQ = 1 - RSS/((NCASE-1)*(STDC(Y))^2); @ R-SQUARED @
COV = INV(NCASE-K)*RSS*INV(X'X); @ OLS COVARIANCE MATRIX @
SE = SQRT(DIAG(COV)); @ STD ERRS OF BETAS @
T = B ./ SE; @ T-STATISTICS FOR BETAS @
PT = 2*CDFTC(ABS(T),(NCASE-K)); @ P-VALUES @
PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @

" ";
" ";
" ";
" OLS RESULTS ";
" ";
" ";
" NUMBER OF OBSERVATIONS = ";; NCASE;
" STANDARD ERROR OF REGRESSION = ";; SER;
" RESIDUAL SUM OF SQUARES = ";; RSS;
" R-SQUARED = ";; RSQ;
" ";
" ";
" VARIABLE COEFF STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K -1;

FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;
" ";
"f";
@-------- CONSTRUCT VECTOR OF LEVERAGE MEASURES --------@
@-------- THE MATRIX X CONTAINS "OBSERVATION --------@
@-------- NUMBER" IN THE FIRST COLUMN AND THE --------@
@-------- CORRESPONDNING LEVERAGE MEASURE IN THE --------@
@-------- SECOND COLUMN. --------@

N = NCASE;
I = 1;

XXI = INV(X'X);
H = ZEROS(N,2);

DO WHILE I <= N; @ LOOP OVER WHOLE SAMPLE @

Z = X[I,.];
HII = Z*XXI*Z'; @ ITH LEVERAGE MEASURE @
H[I,1] = I;
H[I,2] = HII;

I = I + 1;

ENDO; @ END OF LOOP @

@-------- SORT H BY THE MAGNITUDE OF THE LEVERAGE --------@

H = SORTC(H,2);
M = H[1:N/2,.]; @ SELECT LOWER HALF OF H @
M = SORTC(M,1); @ AND SORT BY OBSERVATION @

@-------- CHOOSE ELEMENTS OF X AND Y THAT CORRESPOND --------@
@-------- TO THE OBSERVATIONS IDENTIFIED IN M --------@

YS = Y[M[.,1],.];
XS = X[M[.,1],.];

@-------- OLS ON SUBSET OF DATA HAVING SMALL --------@
@-------- LEVERAGE VALUES --------@

NS = ROWS(XS);
KS = COLS(XS);
BS = INV(XS'XS)*XS'YS;
E = YS - XS*BS; @ RESIDUALS @
RSSS = E'E; @ RESIDUAL SUM OF SQ @
SER = SQRT(INV(NS-KS)*RSS); @ STD ERR OF REGRESSION @
RSQ = 1 - RSSS/((NS-1)*(STDC(YS))^2); @ R-SQUARED @
COV = INV(NS-KS)*RSSS*INV(XS'XS); @ OLS COVARIANCE MATRIX @
SE = SQRT(DIAG(COV)); @ STD ERRS OF BETAS @
T = B ./ SE; @ T-STATISTICS FOR BETAS @
PT = 2*CDFTC(ABS(T),(NS-KS)); @ P-VALUES @

PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @

" ";
" ";
" ";
" OLS RESULTS FOR SUBSAMPLE WITH SMALL LEVERAGE VALUES";
" ";
" ";
" NUMBER OF OBSERVATIONS = ";; NS;
" STANDARD ERROR OF REGRESSION = ";; SER;
" RESIDUAL SUM OF SQUARES = ";; RSSS;
" R-SQUARED = ";; RSQ;
" ";
" ";
" VARIABLE COEFF STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= KS - 1;

FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;

" ";

@-------- CALCULATION OF THE RAINBOW TEST STATISTIC --------@

DFN = N/2; @ NUMERATOR D.F. @
DFD = N/2 - K; @ DENOMINATOR D.F. @

U = ( (RSS - RSSS) / DFN ) / ( RSSS / DFD ) ;

PU = CDFFC(U,DFN,DFD);

" ";
" ";
" ";
" RAINBOW TEST STATISTIC: U =";; U;
" ";
" ";
" NUMERATOR D.F. =";; DFN;
" DENOMINATOR D.F. =";; DFD;
" ";
" P-VALUE =";; PU;

"f";

OUTPUT FILE = RAINBOW.OUT OFF;
SYSTEM;

Figure 47 - Sample program for Utts’ Rainbow test, in SAS PC

**************************************************************************
* PROGRAM: RAINBOW.SAS SOFTWARE: SAS PC 6.04
* FILENAME DESCRIPTION
* INPUTS: DATA.SSD TEST DATA SET
* PURPOSE: EXECUTE AND REPORT UTTS' RAINBOW TEST
* FOR ADEQUACY OF FUNCTIONAL FORM
*************************************************************************;

LIBNAME CDRV 'C:DATA';

* MODEL WITH ALL OBSERVATIONS (MODEL 1).;

PROC REG DATA=CDRV.DATA;

MODEL Y1=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3;
OUTPUT OUT=HDATA H=LEV;

RUN;

* MODEL WITH HALF OF THE OBSERVATIONS (812) WHICH HAVE
* THE LEAST LEVERAGE (MODEL 2).;

PROC RANK DATA=HDATA OUT=RHDATA GROUP=2;

VAR LEV;
RANKS RLEV;

RUN;

PROC REG DATA=RHDATA;

WHERE RLEV=0;
MODEL Y1=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3;

RUN;

* RETAIN THE RESPECTIVE RESIDUAL SUM OF SQUARES (RSS) VALUES.
* TEST STATISTIC CALCULATION FROM OUTPUT.
* HERE, THE UTTS' TEST STATISTIC, U, IS CALCULATED AS:
* [(RSS MODELR - RSS MODELU)/(1624-812)]/[RSS MODELU/(812-19)]=1.195.
* U IS DISTRIBUTED AS AN F STATISTIC WITH N/2, (N/2)-K DEGREES OF FREEDOM.
* THE NULL HYPOTHESIS IS REJECTED (F CRITICAL = 1).
* SEE TEXT FOR INTERPRETATION.;

Figure 48 - Sample program for Utts’ Rainbow test, in SPSS/PC+

SET MORE = OFF.
SET LIS = 'RAINBOW.LIS'.
SET LOG = 'RAINBOW.LOG'.
***********************************************************************
* PROGRAM: RAINBOW.SPS SOFTWARE: SPSS/PC+ 4.01
* FILENAME DESCRIPTION
* INPUTS: DATA.SYS TEST DATA SET
* PURPOSE: EXECUTE AND REPORT UTTS' RAINBOW TEST
* FOR ADEQUACY OF FUNCTIONAL FORM
**********************************************************************.

GET FILE = 'DATA.SYS' .

* MODEL WITH ALL OBSERVATIONS (RESTRICTED MODEL).
REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8

RD1 RD2 RD3
/DEPENDENT=Y1
/METHOD=ENTER
/SAVE LEVER(LEV).

* THE LEV VARIABLE INDICATES THE INFLUENCE EACH OBSERVATION HAS ON THE
* COEFFICIENT ESTIMATES.
RANK LEV /NTILE (2).

* MODEL WITH HALF OF THE OBSERVATIONS (812) WHICH HAVE
* THE LEAST LEVERAGE (UNRESTRICTED MODEL).

PROCESS IF (NLEV = 1).
REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15, D1 D2 D3 D5 D6 D7 D8

RD1 RD2 RD3
/DEPENDENT=Y1
/METHOD=ENTER.

* RETAIN THE RESPECTIVE RESIDUAL SUM OF SQUARES (RSS) VALUES.
* TEST STATISTIC CALCULATION FROM OUTPUT.
* HERE, THE UTTS' TEST STATISTIC, U, IS CALCULATED AS:
* [(RSS MODELR - RSS MODELU)/(1624-812)]/[RSS MODELU/(812-19)]=1.195.
* U IS DISTRIBUTED AS AN F STATISTIC WITH N/2, (N/2)-K DEGREES OF FREEDOM.
* THE NULL HYPOTHESIS IS REJECTED (F CRITICAL = 1).
* SEE TEXT FOR INTERPRETATION.
FINISH.

Linear Splines

This technique is useful for approximating a curvilinear regression without specifying the mathematical form of the curvature. A linear spline is a continuous piecewise-linear function, that is, one in which the adjacent line segments meet at the interval boundaries (or “knots”). As with other models that incorporate break points, the number and location of the intervals may be difficult to specify a priori. Attention should be paid to theoretical considerations, although a grid data search may also be employed, as in the example below. The linear spline is most appropriately used where the regression model is expected to be linear, but to have structural breaks at specific values of an explanatory variable. In the standard regression model the coefficients of the regression are restricted to be equal across spline segments. The standard version of this model is

y=Xb+Zg+e

However, it is expected that the response of y to changes in Z is distinct for three distinct regions of Z. In the example at hand, y is household calorie intake per day and Z is total weekly household expenditures. X contains all of the remaining regressors. The relationship between caloric intake and total expenditures might be expected to be different for low-expenditure, medium-expenditure, and high-expenditure families, but the precise dividing lines between low, medium, and high may not be known. The spline program will help to determine this. Note that this model has two knots; it is possible to develop models that have more, but the tensions among good fit, theory, and parsimonious parameterization should be kept in mind.

It is useful to begin by considering this model as a dummy-variable model with D1 = 1 for medium-expenditure households, zero otherwise; and D2 = 1 for high-expenditure households, zero otherwise. Then the model is

y=Xb+D1g1+D1Zg1+D2g2+D2Zg2+e

The dummy variable model does not guarantee that the piecewise segments join at the knots. Let the first knot be at L, so that low-expenditure households have income Z £ L. The second knot is at H, so that low- and medium-expenditure households have Z £ H. Then continuity at the knots is ensured if the model is specified as

y=Xb+D1(Z-L)g1+D2(Z-H)g2+e

One way to proceed is to program the computer to do a grid search over L and H, performing OLS for each (L, H) pair and checking for the pair that minimizes the RSS. These sample programs illustrate this approach. Whether the spline function leads to a significant improvement in RSS may be tested with a standard F-test (note that this is a simple application of the Chow test for structural stability). In this version of the F-test, the numerator degrees of freedom is equal to the number of knots specified and the denominator degrees of freedom is equal to the sample size less the total number of coefficients estimated in the spline function model. An alternative approach to spline modeling is given in Johnston 1984, 392-394.

The sample programs determine that the knot dividing low- and medium-expenditure households is at a log-expenditure level of approximately Z = 2.45 and that the knot dividing medium- and high-expenditure households is at a log-expenditure level of approximately Z = 4.45. The F-test (performed only in GAUSS-386) for the restricted (linear) model versus the unrestricted model (spline) yields F = 5.3889 (P-value = 0.0047), and the linear model is rejected in favor of the spline function.

NOTE: Since SPSS/PC+ for DOS does not include looping or macro capabilities (although SPSS/PC+ for Windows does allow loops), the spline program is not feasible. To accomplish the grid-search procedure, the SPSS/PC+ program would include thousands of lines, with the same batch of 15 to 20 lines repeated hundreds of times.

The spline program in SAS PC is feasible but a little clumsy. The program relies heavily on the macro facility included in SAS PC. This makes it difficult to understand. Basically, the macro feature allows the user to define his/her own procedure (in this case, SPLINE) and then run this new procedure with user-defined parameters (START1, STOP1, STOP2, INCRM, and DENOM).

In GAUSS-386, the spline program is more straightforward. Techniques used in this program are not unusual for GAUSS code; most GAUSS programmers could easily understand the program.

Notice that the sample programs (Figures 49 and 50) carry out an extensive grid search over a finely divided grid. This is not necessary: experimentation with large grid steps may enable the investigator to quickly narrow down the regions in which the knots lie; then a finer search may pinpoint them. Note also that the loops begin the grid search for the upper point (H or CUTOFF2) a specific distance above the lower point (L or CUTOFF1) to avoid overlapping regions for low- and high-expenditure households.

Recommended references: Greene (1990, 248-251); Johnston (1984, 392-396); Kmenta (1986, 569); Stewart and Wallis (1981, 202-204); Suits, Mason, and Chan (1978, 132-133).

Figure 49 - Sample spline program, in GAUSS-386

/**********************************************************************
* PROGRAM: SPLINE.G SOFTWARE: GAUSS-386 V3.0
* FILENAME DESCRIPTION
* INPUTS: DATA.DAT GAUSS-386 DATA SET
* PURPOSE: USE SPLINE FUNCTION TO CHECK FOR NON-
* LINEARITY WITH RESPECT TO VARIABLE X10
**********************************************************************/

@-NOTE: RUN TIME IS ABOUT 7 MINUTES ON 486DX2-66, -8

FORMAT /M2 /RD 12,4;

OUTPUT FILE = SPLINE.OUT RESET;

NAMES = GETNAME("DATA");
OPEN D = DATA VARINDXI;
NCASE = ROWSF(D);
DATA = READR(D,NCASE);
F = CLOSE(D);

Y = DATA[.,IY1];

Z1 = DATA[.,IX10];

X = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3]

~ Z1;

NAMES = NAMES[IX1 IX2 IX8 IX9 IX13 IX14 IX15

ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3 IX10,.];

@-------- OLS ESTIMATION --------@

K = COLS(X);

B = INV(X'X)*X'Y; @ BETAS @
E = Y - X*B; @ RESIDUALS @
RSS = E'E; @ RESIDUAL SUM OF SQ @
SER = SQRT(INV(NCASE-K)*RSS); @ STD ERR OF REGRESSION @
RSQ = 1 - RSS/((NCASE-1)*(STDC(Y))^2); @ R-SQUARED @
COV = INV(NCASE-K)*RSS*INV(X'X); @ OLS COVARIANCE MATRIX @
SE = SQRT(DIAG(COV)); @ STD ERRS OF BETAS @
T = B ./ SE; @ T-STATISTICS FOR BETAS @
PT = 2*CDFTC(ABS(T),(NCASE-K)); @ P-VALUES @

PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @

" ";
" ";
" ";
" OLS RESULTS ";
" ";
" ";
" NUMBER OF OBSERVATIONS = ";; NCASE;
" STANDARD ERROR OF REGRESSION = ";; SER;
" RESIDUAL SUM OF SQUARES = ";; RSS;
" R-SQUARED = ";; RSQ;
" ";
" ";
" VARIABLE COEFF STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K -1;

FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;
" ";
"f";

@-------- LOOPS FOR SPLINE FUNCTION --------@
@-------- L-LOOP IS OUTER LOOP (FOR LOWER KNOT AT L) --------@
@-------- H-LOOP IS INNER LOOP (FOR UPPER KNOT AT H) --------@

OUTPUT FILE = SPLINE.OUT OFF;

RSSR = RSS; @ RSS FOR ORIGINAL LINEAR MODEL @

@ THE "RESTRICTED" MODEL @

RSSMIN = RSS;

L = 2.20; @ OUTER LOOP TAKES L FROM 2.20 @
DO WHILE L <= 4.25 ; @ TO 4.25 @

H = L + 0.5; @ INNER LOOP TAKES H FROM L+0.5 @
DO WHILE H <= 5.25; @ TO 5.25 @

D1 = DUMMYDN(Z1,L,2);
D2 = DUMMYDN(Z1,H,2);

XS = X ~ D1.*(Z1 - L*ONES(NCASE,1)) ~ D2.*(Z1 - H*ONES(NCASE,1));

BS = INV(XS'XS)*XS'Y;

ES = Y - XS*BS;

RSS = ES'ES;

IF RSS < RSSMIN;

RSSMIN = RSS; @ KEEP MINIMUM RSS @
LOPT = L; @ L ASSOCIATED WITH MIN RSS @
HOPT = H; @ H ASSOCIATED WITH MIN RSS @

ENDIF;

@-------- SHOW PROGRESS OF ITERATIONS ON SCREEN --------@

FORMAT /M1 /RD 5,2; "L =";; L;; "H =";; H;;
FORMAT /M1 /RD 12,0; "RSSMIN =";; RSSMIN;; "RSSR =";; RSSR;

H = H + 0.1;

ENDO;

L = L + 0.1;

ENDO;

OUTPUT FILE = SPLINE.OUT ON;

@-------- OLS REGRESSION FOR SELECTED SPLINE FUNCTION --------@

NAMES = NAMES | "Z2" | "Z3";

D1 = DUMMYDN(Z1,LOPT,2);
D2 = DUMMYDN(Z1,HOPT,2);

Z2 = D1.*(Z1 - LOPT*ONES(NCASE,1));
Z3 = D2.*(Z1 - HOPT*ONES(NCASE,1));

X = X ~ Z2 ~ Z3;

K = COLS(X);

B = INV(X'X)*X'Y; @ BETAS @
E = Y - X*B; @ RESIDUALS @
RSSU = E'E; @ RESIDUAL SUM OF SQ @
SER = SQRT(INV(NCASE-K)*RSSU); @ STD ERR OF REGRESSION @
RSQ = 1 - RSSU/((NCASE-1)*(STDC(Y))^2); @ R-SQUARED @
COV = INV(NCASE-K)*RSSU*INV(X'X); @ OLS COVARIANCE MATRIX @
SE = SQRT(DIAG(COV)); @ STD ERRS OF BETAS @
T = B ./ SE; @ T-STATISTICS FOR BETAS @
PT = 2*CDFTC(ABS(T),(NCASE-K)); @ P-VALUES @

PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @

@-------- PRINT RESULTS FOR SELECTED SPLINE FUNCTION --------@

FORMAT /M1 /RD 12,4;

" ";
" ";
" ";
" RESULTS FOR SELECTED SPLINE FUNCTION ";
" ";
" ";
" KNOTS ARE LOCATED AT:";
" ";
" L = ";; LOPT;
" H = ";; HOPT;
" ";
" NUMBER OF OBSERVATIONS = ";; NCASE;
" STANDARD ERROR OF REGRESSION = ";; SER;
" RESIDUAL SUM OF SQUARES = ";; RSSU;
" R-SQUARED = ";; RSQ;
" ";
" ";
" VARIABLE COEFF STD ERROR T-RATIO P-VALUE";
" ";
" INTERCEPT ";; PRN[1,.];

I = 1;
DO WHILE I <= K -1;

FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.];

I = I + 1;

ENDO;
" ";

@-------- F-TEST WHETHER (RSSR - RSSU) IS SIGNIFICANT --------@

DFN = 2; @ NUMERATOR DF = # BREAKS @

@ IN SPLINE @

DFD = NCASE - K;

F = ( (RSSR - RSSU) / DFN ) / (RSSU / DFD );
PF = CDFFC(F,DFN,DFD);

" ";
" ";
" ";
" F-TEST FOR RESTRICTING TO LINEAR MODEL: F =";; F;
" ";
" NUMERATOR DF =";; DFN;
" DENOMINATOR DF =";; DFD;
" ";
" P-VALUE =";; PF;

"f";

OUTPUT FILE = SPLINE.OUT OFF;
SYSTEM;

Figure 50 - Sample spline program, in SAS PC

***********************************************************************
* PROGRAM: SPLINE.SAS SOFTWARE: SAS PC 6.04
* FILENAME DESCRIPTION
* INPUTS: DATA.SSD TEST DATA SET
* OUTPUTS: SPLOUT.SSD RESULTS OF REGRESSIONS
* PURPOSE: USE SPLINE FUNCTION TO CHECK FOR NON-
* LINEARITY WITH RESPECT TO VARIABLE X10
**********************************************************************;

LIBNAME CDRV 'C:DATA';

* NONLINEARITIES ARE SUSPECTED ALONG THE DIMENSION OF THE LOG OF
* TOTAL EXPENDITURE PER CAPITA (X10). X10 WILL BE SPLIT INTO
* THREE SECTIONS.

* THE FOLLOWING PROC SUMMARY AND DATA STEPS MERGE THE MINIMUM AND
* MAXIMUM OF X10 ONTO EACH OBSERVATION IN THE ORIGINAL DATA SET.;

DATA DATAX;

SET CDRV.DATA;
CONSTANT=1;

PROC SUMMARY DATA=DATAX;

VAR X10;
ID CONSTANT;
OUTPUT OUT=MINMAX MIN=MINX10 MAX=MAXX10;

DATA SDATA;

MERGE DATAX MINMAX(DROP=_TYPE_ _FREQ_);
BY CONSTANT;

* THE FOLLOWING DATA STEP WILL CREATE A TEMPORARY BINARY DATA FILE TO STORE
* A MODEL NAME AND ROOT MEAN SQUARE ERROR (RMSE) FOR EACH REGRESSION. THIS
* STEP IS JUST CREATING A FIRST DUMMY RECORD.;

FILENAME OUTPUT 'C:DATASPLINE.BIN';

DATA _NULL_;

_MODEL_ = 'DUMMY';
_RMSE_ = .;
FILE OUTPUT RECFM=N;
PUT

_MODEL_ $8.
_RMSE_ RB4. ;

* THE FOLLOWING STATEMENT BEGINS THE DEFINITION OF THE SAS MACRO.;
%MACRO SPLINE;

* START, STOP AND INCRM MUST BE INTEGERS;
* THEREFORE, THE VALUES ARE DIVIDED BY DENOM IN THE DATA SET;

%DO PNT1 = &START1 %TO &STOP1 %BY &INCRM;

%DO PNT2 = &PNT1 + &INCRM2 %TO &STOP2 %BY &INCRM;

* X10 IS THE VARIABLE ACROSS WHICH WE SUSPECT NON-LINEARITY OF THE
* REGRESSION LINE.;

DATA SPLINE;

SET SDATA (KEEP=Y1 X1 X2 X8 X9 X10 X13 X14 X15

D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3 MINX10 MAXX10);

* THE FOLLOWING USES MACRO VARIABLES TO CREATE THE TWO CUTOFFS.;

CUTOFF1 = &PNT1./&DENOM.;
CUTOFF2 = &PNT2./&DENOM.;

* THE FOLLOWING CREATES Z1, Z2, Z3 AS EXPLAINED IN TEXT.;

IF (X10 LT MINX10) THEN Z1=0;
IF (X10 GE MINX10 AND X10 LT CUTOFF1) THEN

Z1=X10-MINX10;

IF (X10 GE CUTOFF1) THEN

Z1=&PNT1./&DENOM.-MINX10;

IF (X10 LT CUTOFF1) THEN Z2=0;
IF (X10 GE CUTOFF1 AND X10 LT CUTOFF2)

THEN Z2=X10-CUTOFF1;

IF (X10 GE CUTOFF2) THEN

Z2=CUTOFF2-CUTOFF1;

IF (X10 LT CUTOFF2) THEN Z3=0;
IF (X10 GE CUTOFF2 AND X10 LT MAXX10)

THEN Z3=X10-CUTOFF2;

IF (X10 GE MAXX10 ) THEN Z3=MAXX10-CUTOFF2;

* THE FOLLOWING REGRESSION SAVES THE RMSE AND A MODEL LABEL TO THE BINARY ;
* OUTPUT FILE C:DATASPLINE.BIN.;
PROC REG DATA=SPLINE

OUTEST=SPLEST NOPRINT;

P&PNT1.P&PNT2.: MODEL Y1=
X1 X2 X8 X9 Z1 Z2 Z3 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3;

DATA _NULL_;
SET SPLEST;
FILE OUTPUT RECFM=N MOD;
PUT

_MODEL_ $8.
_RMSE_ RB4. ;

* THE FOLLOWING PROVIDES OUTPUT TO THE SCREEN TO MONITOR THE PROGRESS OF THE
* PROGRAM.;
DATA _NULL_;

FILE 'CON';
CUTOFF1 = &PNT1./&DENOM.;
CUTOFF2 = &PNT2./&DENOM.;
PUT " CUTOFF1 = " CUTOFF1 " CUTOFF2 = " CUTOFF2;

%END;

%END;
RUN ;

%MEND SPLINE;

* THE USER MUST PROVIDE SEARCH RANGE FOR CUTOFF1 AND CUTOFF2 AND THE
* INCREMENTS USED TO DETERMINE THE PRECISION OF THE SEARCH. IN SAS, MACRO
* PARAMETERS MUST BE INTEGERS. THEREFORE, WE USE START1, STOP1, STOP2,
* INCRM, AND INCRM2 TO DEFINE PNT1 AND PNT2. THEN, WE DIVIDE THESE INTEGERS
* BY DENOM TO DERIVE CUTOFF1 AND CUTOFF2. THE VALUES OF CUTOFF1 AND CUTOFF2
* ARE IN THE SAME UNITS AS THE VARIABLE OF INTEREST (X10). PNT1 VARIES
* FROM START1 TO STOP1 INCREASING BY INCRM FOR EACH REGRESSION (THIS
* CORRESPONDS TO CUTOFF1 VARYING FROM START1/DENOM TO STOP1/DENOM). FOR
* EACH PNT1 VALUE, PNT2 RANGES FROM PNT1 + INCRM2 TO STOP2, ALSO
* INCREASING BY INCRM FOR EACH REGRESSION.
* FOR OUR EXAMPLE, CUTOFF1 RANGES FROM 2.2 TO 4.25 AT INCREMENTS OF 0.01,
* AND CUTOFF2 RANGES FROM CUTOFF1+0.5 TO 5.25 AT INCREMENTS OF 0.01.
* PRIOR TO THIS DETAILED SEARCH, AN INITIAL ROUGH SEARCH COULD BE
* CONDUCTED WITH LARGER GRID STEPS BY INCREASING THE INCRM.
* FOR INSTANCE, IF INCRM = 25, THE CUTOFFS WILL CHANGE WITH INCREMENTS OF
* 0.25. THE LARGER INCRM VALUE WILL RESULT IN A SUBSTANTIALLY REDUCED
* EXECUTION TIME.;

%LET START1 = 220;
%LET STOP1 = 425;
%LET STOP2 = 525;
%LET INCRM = 10;
%LET INCRM2 = 50;
%LET DENOM = 100;
%SPLINE;

* THE FOLLOWING TO DATA AND PROC STATEMENTS READ IN THE RESULTS FROM EACH
* REGRESSION AND PROVIDES COMPLETE DESCRIPTIVE STATISTICS.;

DATA CDRV.SPLOUT;

INFILE OUTPUT RECFM=N;
INPUT

_MODEL_ $8.
_RMSE_ RB4. ;

PROC UNIVARIATE DATA=CDRV.SPLOUT;

VAR _RMSE_;
ID _MODEL_;

* INTERPRETING OUTPUT;
* THE MODEL WITH THE OPTIMAL CUTOFFS IS INDICATED AS THE MODEL WITH THE
* MINIMUM RMSE (ROOT MEAN SQUARE ERROR) WHICH CORRESPONDS TO THE
* MINIMUM RESIDUAL SUM OF SQUARES.

* PROC UNIVARIATE LISTING DISPLAYS THIS MINIMUM AND THE
* ACCOMPANYING MODEL LABEL (ID) UNDER THE "EXTREMES" HEADING.;
* IN THIS EXAMPLE, THE OPTIMAL CUTOFFS (OR KNOTS) ARE 2.45 AND 4.45
* FOR THE SEARCH INCREMENTS OF 0.25. WITH A SEARCH USING INCREMENTS
* OF 0.1, THE CUTOFFS ARE 2.50 AND 4.50;