
| Strengthening Policy Analysis - Econometric tests using microcomputer software + disk (IFPRI, 1995, 166 p.) |
TESTS FOR HETEROSKEDASTICITY
The major consequence of heteroskedasticity (nonconstant variance of the stochastic disturbance term) is that it causes the OLS estimate of the stochastic error variance (>2) to be biased, rendering hypothesis tests on coefficients invalid. Most tests for heteroskedasticity involve examining the regression residuals; the White test involves comparison of the OLS coefficient covariance matrix with a heteroskedasticity-consistent covariance matrix. The Goldfeld-Quandt, Breusch-Pagan, and White tests are described below. These tests are quite general. The White test is the most general in the sense that it requires no specification of a model of the heteroskedastic error-generating process. The Goldfeld-Quandt test requires only that the heteroskedasticity be related to one of the regressors; the Breusch-Pagan test requires that it be related to some set of regressors. If heteroskedasticity is detected, the usual practice is to specify a model by which the standard deviation of the stochastic disturbance can be estimated at each observation, then used in a weighted least-squares procedure. Whites method produces an estimate of the variance-covariance matrix of coefficients that is consistent in the presence of heteroskedasticity so that tests on the OLS coefficients may be conducted. See the references for details.
The model is the usual one:
y = Xb + e.
The hypothesis to be tested is as follows:
|
|
(constant varianceno heteroskedasticity); |
|
|
(heteroskedasticity). |
Goldfeld-Quandt Test
This older test is only applicable when there is a strong a priori reason to believe that the variance of the error term is explicitly related to one of the explanatory variables, say Xk. This test comprises the following steps:
|
Step 1 |
Reorder the data by magnitude of the observations on Xk, from smallest to largest. |
|
Step 2 |
Partition the ordered data set into three subsets, each of size C = N/3. Delete the middle subset, then denote the subset with small values of Xk as set 1 and the subset with large values of Xk as set 2. |
|
Step 3 |
Perform OLS (using all of the regressors in X) on set 1 and set 2 separately and get the residual sum of squares (RSS) from each set. |
|
Step 4 |
If set 2 has the higher RSS, the estimated variance of the
residuals is positively correlated with the size of Xk.
Calculate |
Compare to standard F-table; if 
> Fcritical at the desired level of
significance, then reject H0 of homoskedasticity.
The GAUSS-386 program (Figure 10) produces a
Goldfeld-Quandt-statistic of 1.5164, with 541 numerator degrees of freedom and
542 denominator degrees of freedom. The P-value is 0.0000, indicating a
strong rejection of the hypothesis of no heteroskedasticity. The SAS PC (Figure
11) and SPSS/PC+ (Figure 12) F-statistics differ slightly (although not
enough to alter the conclusions, 
=
1.4972) because the programs select slightly different numbers of observations
for the lower-and upper-thirds of the data set. The sample programs for this
section use the same basic model that will be used in subsequent sections. It is
assumed that the error variance is monotonically related to variable
X10.
|
NOTE: Some authors recommend using relatively large significance levels (say, 25 percent to 50 percent) for tests of heteroskedasticity such as the Goldfeld-Quandt test since its consequences are severe and consistent estimators are readily available. |
Recommended References: Fomby, Hill, and Johnson (1984, 193-194); Goldfeld and Quandt (1965, 539-547); Greene (1990, 420); Griffiths, Hill, and Judge (1993, 498-499); Judge et al. (1984, 449); Kennedy (1985, 97; 1992, 118); Kmenta (1986, 292-294); Maddala (1988, 164).
Figure 10 - Sample program for Goldfeld-Quandt test, in GAUSS-386
|
/***************************************************************** FORMAT /M2 /RD 12,4; NAMES = GETNAME("DATA"); @------- ASSUME THAT HETEROSKEDASTICITY IS RELATED TO X10
--------@ DATA = SORTC(DATA,IX10); ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3]; NAMES = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.]; @-------- CHOOSE LOWER-THIRD AND UPPER-THIRD DATA SUBSETS --------@ NL = FLOOR(NCASE/3); NU = FLOOR(2*NCASE/3) + 1; @-------- OLS REGRESSIONS ON DATA SUBSETS --------@ K = COLS(XL); " "; FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; ENDO; " "; "f"; K = COLS(XU); " "; I = 1; FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; " "; " "; @-------- CALCULATION OF G/Q TEST STATISTIC --------@ IF RSSL <= RSSU; F = RSSU/RSSL; ELSE; F = RSSL/RSSU; ENDIF; PROB = CDFFC(F,NDF,DDF); " GOLDFELD/QUANDT RESULTS "; "f"; OUTPUT FILE = GQTEST.OUT OFF; |
Figure 11 - Sample program for GoIdfeld-Quandt test, in SAS PC
|
************************************************************* LIBNAME CDRV 'C:DATA'; * WE SUSPECT THAT THE VARIANCE OF THE DISTURBANCE TERM IS RELATED TO X10; * PROC RANK CREATES A NEW VARIABLE (RX10) WITH VALUES OF 0, 1,
OR 2 PROC RANK DATA=CDRV.DATA OUT=DRANK GROUP=3; VAR X10; RUN; PROC REG DATA=DRANK; WHERE RX10 = 0; RUN; PROC REG DATA=DRANK; WHERE RX10 = 2; RUN; * TEST STATISTIC CALCULATION FROM OUTPUT; |
Figure 12 - Sample program for Goldfeld-Quandt test, in SPSS/PC+
|
SET MORE OFF. GET FILE = 'DATA.SYS' . * WE SUSPECT THAT THE VARIANCE OF THE DISTURBANCE TERM IS RELATED TO X10. * RANK CREATES A NEW VARIABLE (RX10) WITH VALUES OF 1, 2, OR
3 RANK X10/NTILE (3) INTO RX10. PROCESS IF ( RX10 = 1 ). D7 D8 RD1 RD2 RD3 PROCESS IF ( RX10 = 3 ). D7 D8 RD1 RD2 RD3 * TEST STATISTIC CALCULATION FROM OUTPUT. |
Breusch-Pagan Test
This test assumes that the disturbance terms, ei, are normally and independently distributed. Moreover, the variances of ei are assumed to be of the form >2=f(Za), where Z is a set of p variables (these may be a subset of the X variables) thought to influence the heteroskedasticity (Z also includes a constant term) and a is a conformable vector of coefficients. This test does not depend on the functional form of f. The test evaluates whether the variables in Z have explanatory power for the variation in squared standardized residuals from the original model.
The model is the usual one:
y=Xb + e
The Breusch-Pagan test follows the following steps:
|
Step 1 |
Estimate the model by OLS and save the vector of residuals e. |
|
Step 2 |
Compute |
|
Step 3 |
Specify the variables in Z, regress v on Z, and compute the explained sum of squares (ESS, sometimes called the regression or model sum of squares). |
|
Step 4 |
Calculate the statistic Q = ESS/2. Q is asymptotically chi-squared (c2) with (p - 1) degrees of freedom. |
|
Step 5 |
Compare Q to |
In the sample programs (Figures 13 through 15), the same model is used as before and the variables in Z are selected to be identical with those in X. The Q value is 68.1561 (P-value = 0.0000) and again, the hypothesis of no heteroskedasticity is strongly rejected.
|
NOTE: As with the Goldfeld-Quandt test, some writers recommend using relatively large significance levels for the Breusch-Pagan test. |
Recommended references: Breusch and Pagan (1979, 1287-1294); Fomby, Hill, and Johnson (1984, 195-196); Greene (1990, 421-422); Griffiths, Hill, and Judge (1993, 498-500); Judge et al. (1984, 446-447); Kennedy (1985, 97-98, 108; 1992, 118, 130-131); Kmenta (1986, 294-295); Maddala (1988, 164).
Figure 13 - Sample program for Breusch-Pagan test, in GAUSS-386
|
/***************************************************************** FORMAT /M2 /RD 12,4; NAMES = GETNAME("DATA"); ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3]; NAMES = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.]; K = COLS(X); @-------- OLS REGRESSION --------@ B = INV(X'X)*X'Y; @ OLS BETAS @ @-------- PRINT OLS RESULTS --------@ " "; I = 1; FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; "f"; @-------- CONSTRUCTION OF STANDARDIZED SQUARED RESIDUALS --------@ G = (E .^ 2)/(INV(NCASE)*E'E); @-------- CHOOSE REGRESSORS THAT EXPLAIN HETEROSKEDASTICITY
--------@ Z = X; GHAT = Z*D; @ FITTED STANDARDIZED @ ESS = SUMC( (GHAT - MEANC(GHAT))^2 ); @ ESS FROM B-P REGRESSION
@ @-------- PRINT B-P REGRESSION AND B-P TEST STATISTIC
--------@ I = 1; FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; " "; "f"; OUTPUT FILE = BPTEST.OUT OFF; |
Figure 14 - Sample program for Breusch-Pagan test, in SAS PC
|
************************************************************* LIBNAME CDRV 'C:DATA'; * VARIANCE OF DISTURBANCE TERM THOUGHT TO BE RELATED TO ALL PROC REG DATA=CDRV.DATA; MODEL Y1=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3; RUN; DATA E2; SET PRED; RUN; PROC SUMMARY DATA=E2; VAR E2; RUN; DATA G; MERGE E2 MEANE2; RUN; PROC REG DATA=G; MODEL G=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3; RUN; * TEST STATISTIC CALCULATION FROM OUTPUT; |
Figure 15 - Sample program for Breusch-Pagan test, in SPSS/PC+
|
SET MORE = OFF. GET FILE = 'DATA.SYS' . * VARIANCE OF DISTURBANCE TERM THOUGHT TO BE RELATED TO ALL REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3
D5 D6 D7 D8 RD1 RD2 RD3 COMPUTE E2 = RES**2. AGGREGATE
OUTFILE='MEANE2.SYS' /BREAK=CONSTANT JOIN MATCH
FILE='E2.SYS' /TABLE='MEANE2.SYS' COMPUTE G=(E2/MEANE2). REGRESSION VARIABLES = G X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5
D6 D7 D8 RD1 RD2 RD3 * TEST STATISTIC CALCULATION FROM OUTPUT. |
The White Test
The presence of heteroskedasticity makes the OLS variance-covariance matrix of coefficients inconsistent. White (1980) introduced an estimated variance-covariance matrix for the OLS coefficients that is consistent under heteroskedasticity. White also introduced a test statistic for heteroskedasticity based on the extent to which the OLS variance-covariance matrix departs from Whites heteroskedasticity-consistent covariance matrix.
One great advantage of Whites procedure is that it produces an estimator for the variance-covariance matrix of coefficients that is consistent in the presence of heteroskedasticity, so that tests regarding the coefficients can be conducted without having to first correct for the heteroskedasticity. However, the White test may not be as powerful as some alternative tests that use more specific information about the form of the heteroskedasticity.
Whites heteroskedasticity-consistent covariance matrix and the original form of his test are clearly laid out in several references, including those listed below. The test using the full set of explanatory variables is only presented in GAUSS-386 (Figure 16). This is because it is quite time-consuming to compute Whites test manually in SPSS/PC+ and SAS PC for anything but a small set of explanatory variables (see Figures 17 and 18). While SAS PC has options for computing the heteroskedasticity-consistent covariance matrix and Whites test automatically (ACOV SPEC), the SPEC algorithm appears to have a bug that a patch could not completely correct.
The GAUSS-386 program has two parts: first, the heteroskedasticity-consistent covariance matrix is computed, then the test for heteroskedasticity is conducted. Note that the investigator must be vigilant to avoid introducing redundancies among the constructed regressors for this test, especially if dummy variables are present.
The procedure for computing the test is as follows:
|
Step 1 |
Perform ordinary least squares on the model, save the residual vector e, and construct an N × 1 vector of squared residuals, e2. |
|
Step 2 |
Compute the squares and cross-products of all regressors, deleting all redundancies. The obvious redundancies are those produced by the constant term and dummy variables. Your final set of regressors should include the original variables and all nonredundant squares and cross-products. |
|
Step 3 |
Regress the squared residuals, e2, on the regressors from step 2, using OLS. Retain the R2 from this auxiliary regression. |
|
Step 4 |
Compute the test statistic, W = N × R2. |
|
Step 5 |
W will be asymptotically distributed c 2, with degrees of freedom
equal to the number of regressors in step 3. If W > |
The sample GAUSS-386 program produces the White heteroskedasticity-consistent covariance matrix. Notice that the square roots of its diagonal elements are quite different from the OLS standard errors; it is expected that a formal test of the differences will find them significant. The test statistic, W, is 192.384 (df = 183). The null hypothesis of no heteroskedasticity is rejected.
The SAS PC and SPSS/PC+ programs for the reduced explanatory variable set produce a test statistic, W = 24.3031 (df = 120). Again, the null hypothesis of no heteroskedasticity is rejected.
Recommended references: Fomby, Hill, and Johnson (1984, 196); Greene (1990, 403-104); Kennedy (1985; 98, 108; 1992, 90, 118, 130-131); Kmenta (1986, 295-296); Maddala (1988, 162); Messer and White (1984, 181-184); White (1980, 817-838).
Figure 16 - Sample program for White test, in GAUSS-386
|
/********************************************************************************* FORMAT /M2 /RD 12,4; Y = DATA[.,IY1]; X = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IX10 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3]; NAMES = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.]; @-------- OLS ESTIMATION --------@ K = COLS(X); B = INV(X'X)*X'Y; @ BETAS @ " "; I = 1; FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; " "; @-------- SQUARE THE RESIDUAL FOR EACH OBSERVATION. --------@ S = E.^2; @-------- HETEROSKEDASTICITY-CONSISTENT COVARIANCE MATRIX --------@ HETCM = ZEROS(K,K); I = 1; DO WHILE I <= NCASE; HETCT = S[I,.].*((X[I,.])'(X[I,.])); ENDO; HETC = INV(X'X)*HETCM*INV(X'X); HSE = SQRT(DIAG(HETC)); " "; " "; CLEAR HSE HETC HETCM HETCT PRN PT T SE OLSC RSQ SER RSS E B Y; @-------- CONSTRUCT VARIABLES FOR WHITE-AUGMENTED
--------@ X = X[.,1:(K-3)]; K = COLS(X); AUGX = X; I = 2; OUTPUT FILE = WHITE.OUT OFF; DO WHILE I <= K; AUGX = AUGX ~ (X[.,I] .* X[.,I:K]); ENDO; OUTPUT FILE = WHITE.OUT ON; W = AUGX ~ (DATA[.,IRD1] .* X) ~ (DATA[.,IRD2] .* X) CLEAR AUGX X DATA; K = COLS(W); D = INV(W'W)*W'S; ES = S - W*D; CLEAR W; RSSW = ES'ES; RSQW = 1 - RSSW/((NCASE-1)*(STDC(S))^2); DF = K - 1; WTEST = NCASE*RSQW; PW = CDFCHIC(WTEST,DF); "f"; OUTPUT FILE = WHITE.OUT OFF; |
Figure 17 - Sample program for White test, in SAS PC
|
****************************************************************** LIBNAME CDRV 'C:DATA'; PROC REG DATA=CDRV.DATA; MODEL Y1=X1 X2 X9 X10; RUN; DATA XRDATA; SET RDATA; * EACH VARIABLE SQUARED; ZX1 = X1**2; * INTERACTION WITH X1; X1X2 = X1*X2; * INTERACTION WITH X2; X2X9 = X2*X9; * INTERACTION WITH X9; X9X10 = X9*X10; RUN; PROC REG DATA=XRDATA; MODEL RESSQ=X1 X2 X9 X10 ZX1 ZX2 ZX9 ZX10 RUN; * TEST STATISTIC CALCULATION FROM OUTPUT; * FOR THIS EXAMPLE N=1624, K=15, R-SQ= 0.01496, AND W=24.295.
K=120. THE |
Figure 18 - Sample program for White test, in SPSS/PC+
|
SET MORE=OFF. GET FILE = 'DATA.SYS' . Y1 X1 X2 X9 X10 COMPUTE RESSQ = RES**2. * EACH VARIABLE SQUARED. * INTERACTION WITH X1. * INTERACTION WITH X2. * INTERACTION WITH X9. REGRESSION VARIABLES = RESSQ X1 X2 X9 X10 * TEST STATISTIC CALCULATION FROM OUTPUT. * THE WALD TEST STATISTIC, W, EQUALS R-SQUARED FROM THE SECOND REGRESSION * (WHICH CONTAINS THE TRANSFORMATIONS OF X1, X2, X9, AND X10) * MULTIPLIED BY THE NUMBER OF OBSERVATIONS USED IN THE REGRESSION. W IS * DISTRIBUTED AS CHI-SQUARED WITH K(K+1)/2 DEGREES OF FREEDOM. IF W * IS GREATER THAN THE CRITICAL CHI-SQUARED VALUE, THEN THE NULL HYPOTHESIS * OF HOMOSKEDASTICITY IS REJECTED. * FOR THIS EXAMPLE N=1624, R-SQ= 0.01496, K=15, AND W=24.295.
K=120. THE |
NORMALITY OF RESIDUALS : THE JARQUE-BERA TEST
If the elements of the disturbance vector are not normally distributed, the OLS estimators for b are still best linear unbiased, but the usual t- tests and F-tests are no longer appropriate, and appropriate asymptotically justified tests should be used.
The Jarque-Bera test checks whether the skewness (symmetry) and
kurtosis (fatness of tails) of the distribution of residuals matches the
skewness and kurtosis expected under the null hypothesis that the disturbances
are normally distributed. Skewness is measured by Öb1 =
µ3/µ
and
kurtosis is measured by b2 =
µ4/µ
where
estimates of the moments µr are given by 1/NS
(r = 2, 3, 4).
Under the null hypothesis that the disturbances are normally distributed,
b1 = 0 and b2 = 3. Thus, the null hypothesis is
H0: b1 = 0 and b2 = 3.
The alternative hypothesis is that the disturbances are not normal and belong to a class of distributions called the Pearson family. The test statistic is
h=N[(z1/6) + (z2-3)2/24],
where z1 and z2 are the
estimates of b1 and b2, and N is the number of
observations. h has a 
2 distribution with 2 degrees of freedom. Note
that h = 0 if z1 = 0
and z2 = 3.
Construction of the test proceeds by the following steps:
|
Step 1 |
Estimate the model by OLS and save the residual vector, e. |
|
Step 2 |
Calculate the sample estimates of the second, third, and fourth moments of the residuals about their mean (which is zero by construction):
where µr is the rth
moment about the mean and the eis are the OLS residuals.
Denote these as |
|
Step 3 |
Calculate z1 = |
|
Step 4 |
Calculate h and compare
to the critical value at desired level of significance with two degrees of
freedom. If h > |
For this model, the Jarque-Bera test statistic is 274.2360 (P-value = 0.0000) and the null hypothesis of normality of disturbance terms is rejected.
Figures 19, 20, and 21 are sample programs for the Jarque-Bera test, in GAUSS-386, SAS PC, and SPSS/PC+, respectively.
|
NOTE: For an additional normality test, see Shapiro and Wilks (1965) and Shapiro, Wilks, and Chen (1968). |
Recommended references: Bowman and Shenton (1975, 243-250); Jarque and Bera (1981); Kennedy (1992, 79); Kmenta (1986, 260-267).
Figure 19 - Sample program for Jarque-Bera test, in GAUSS-386
|
/*************************************************************************** FORMAT /M2 /RD 12,4; Y = DATA[.,IY1]; X = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IX10 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3]; NAMES = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.]; @-------- OLS ESTIMATION --------@ K = COLS(X); B = INV(X'X)*X'Y; @ BETAS @ " "; I = 1; I = I + 1; ENDO; " "; @-------- COMPUTATION OF SECOND, THIRD, AND FOURTH MOMENTS
--------@ E2 = E^2; U2 = (SUMC(E2))/NCASE; Z1 = (U3/(U2^(3/2)))^2; ETA = NCASE*((Z1/6) + (((Z2-3)^2)/24)); PCHI = CDFCHIC(ETA,2); " "; "f"; OUTPUT FILE = JBTEST.OUT OFF; |
Figure 20 - Sample program for Jarque-Bera test, in SAS PC
|
**************************************************************************** LIBNAME CDRV 'C:DATA'; MODEL Y1=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3; RUN; DATA JARQUE2; SET JARQUE; RUN; PROC SUMMARY DATA=JARQUE2; VAR E2 E3 E4 CONST; RUN; DATA CALC; SET RESSUM; RUN; PROC PRINT DATA=CALC; VAR ETA; RUN; * TEST STATISTIC CALCULATION FROM OUTPUT. * ETA IS THE TEST STATISTIC AND IS DISTRIBUTED AS CHI-SQUARED WITH TWO * DEGREES OF FREEDOM. IF ETA IS GREATER THAN THE CRITICAL CHI-SQUARED * THEN REJECT THE NULL HYPOTHESIS OF NORMALLY DISTRIBUTED RESIDUALS. * ETA IN THIS EXAMPLE IS 274.24, WHICH IS LARGER THAN THE CRITICAL CHI- * SQUARED VALUE. NORMALITY IS REJECTED.; |
Figure 21 - Sample program for Jarque-Bera test, in SPSS/PC+
|
SET MORE = OFF. GET FILE = 'DATA.SYS' . D1 D2 D3 D5 D6 D7 D8, RD1 RD2 RD3 COMPUTE E2 = RES**2. COMPUTE E3 = RES**3. COMPUTE E4 = RES**4. COMPUTE CONST = 1. AGGREGATE OUTFILE = * /BREAK=CONST COMPUTE MU2 = SUME2/NCASE. COMPUTE MU3 = SUME3/NCASE. COMPUTE MU4 = SUME4/NCASE. COMPUTE Z1 = ((MU3)/(MU2**(3/2)))**2. COMPUTE Z2 = MU4/MU2**2. COMPUTE ETA = NCASE*((Z1/6)+(((Z2-3)**2)/24)). LIST ETA. * TEST STATISTIC CALCULATION FROM OUTPUT. * ETA IS THE TEST STATISTIC AND IS DISTRIBUTED AS CHI-SQUARED WITH TWO * DEGREES OF FREEDOM. IF ETA IS GREATER THAN THE CRITICAL CHI-SQUARED * THEN REJECT THE NULL HYPOTHESIS OF NORMALLY DISTRIBUTED RESIDUALS. * ETA IN THIS EXAMPLE IS 274.24, WHICH IS LARGER THAN THE CRITICAL CHI- * SQUARED VALUE. NORMALITY IS REJECTED. FINISH. |
ERRORS IN VARIABLES
A crucial assumption of the classical linear regression model is that the elements of the X matrix of regressors are nonstochastic. If any of the regressors are stochastic, then the problem of simultaneity bias or endogeneity may be faced. One common source of endogeneity is measurement error in the regressors.
There is little doubt that almost all observed variables are measured with error. While the emergence of extensive household surveys represents a wealth of information at the level of the household and individual, the possibility and consequences of measurement error in those data should be considered.
This discussion focuses on the simple linear regression model, that is, the model with a single regressor. The extension to the multiple regression context is straightforward and is illustrated in the sample programs (Figures 22 through 24).
Assume that

(1)
denotes the true model and that both x and y are measured with error. Let the errors be µ and v, respectively. Assume that the errors are normally distributed, with mean zero, and with constant variances so that

,
and

.
Moreover, assume that v and µ are uncorrelated with each other and are uncorrelated with all elements of x. Now write

and

where an asterisk denotes an observed as opposed to a true value. Rewriting equation 1 gives

or

(2)
where

If x is measured with error, then the OLS assumption, cov
(w,x*) = 0, is violated because x* and
w both contain µ. In fact, the covariance between the
stochastic regressors, x*, and the error term is 
(see Maddala 1988, 381 for details), and the estimated
coefficient on b is biased toward zero.
In the multiple regression framework, the coefficient of the erroneously measured regressor is also biased toward zero. In addition, the coefficients on the remaining regressors are biased, but establishing the signs of the biases is more complicated.
The consequences of measurement error on y as opposed to x are very different. For example, if x is not measured with error, then measurement error in the dependent variable, y, is merely absorbed into the additive error term (e+ v), which does not violate any of the assumptions of the classical OLS model.
Below, two tests that examine the importance of measurement error in regressors are discussed.
The Hausman Test
The Hausman test takes advantage of the instrumental variables (IV) estimator, which (with appropriate instruments) is consistent in the presence of measurement error. Under the null hypothesis of no measurement error, the IV estimator is consistent but inefficient, while OLS is consistent and efficient. The essence of the Hausman test is to determine whether the difference between the OLS and IV estimators is statistically significant.
Now return to the multiple linear regression model,
y=Xb+ e,
and assume that the kth variable in X is measured with error. As a consequence, all elements of the OLS estimator of b are biased.
The Hausman test is implemented by first constructing an IV
estimator for the model. The existence of a matrix of L additional
regressors that are highly correlated with Xk but uncorrelated
with e is assumed. A common method for
constructing instruments is to regress the matrix X on a set of
regressors Z that includes all variables in X except Xk
and all of the additional regressors in L, so that Z has (K + L
- 1) regressors. The fitted value of Xk is then used as an
instrument for Xk. The columns of X excluding
Xk are simply replicated, but the kth column
is replaced by fitted values. Call this matrix
. The instrumental variables estimator is then

.
Let 
. Then a consistent
estimator for the asymptotic variance-covariance matrix is
>2VIV,
where
>2=ee /
(N-K),
with

.
Notice that X is used here rather than 
. By comparison, the OLS estimator is 
and V0 is defined as
(XX)-1.
The difference between the OLS and IV estimators is defined as

.
Finally, the Hausman
statistic is defined:

,
where P>2 may be estimated either from the OLS residuals or from the IV residuals, and where qk is the kth element of q and [VIV - V0]k-1 is the kth diagonal element of [VIV - V0]-1. Many presentations of this test statistic do not indicate that it is constructed with the subvectors and submatrices designated by k. As Griffiths, Hill, and Judge (1993, 476) point out, those presentations assume that Z and X have no columns in common. When they do have columns in common, then the test statistic is constructed with the subvectors and submatrices that correspond with the columns of X not also in Z, namely the kth column that has been replaced by fitted values.
W is asymptotically chi-square, with one degree of freedom. Sample values of W that exceed the selected critical value indicate significant differences between the OLS and IV estimators, hence indicate the presence of measurement error (or other source of endogeneity). Please refer to the references for cases in which more than one regressor is measured with error.
The Hausman test may be implemented in the following steps:
|
Step 1 |
Regress X on the set of instrumental variables Z and retain the fitted values:
|
|
Step 2 |
Regress y on the set of instruments, X, to give
|
|
Step 3 |
Calculate the Hausman statistic as described above. |
If the Hausman statistic is statistically significant, then reject the hypothesis of no endogeneity and use the instrumental variables estimates. Otherwise, the OLS estimates are suitable.
The Hausman-Wu Test
An alternative approach to testing for endogeneity of a single variable in X is provided by the Hausman-Wu test:
|
Step 1 |
Regress Xk on the set of instrumental variables Z and retain the first-stage residuals: e=Xk-Z(ZZ)-1ZXk. |
|
Step 2 |
Add the vector of first-stage residuals to the original regression specification, y=Xb+eg+e=Wd+e, |
|
Step 3 |
Estimate this equation by OLS and check whether the estimated coefficient on u is zero. If it is statistically significantly different from zero, then reject the hypothesis that Xk is not endogenous. |
Notice that the b estimators obtained here are identical to the IV estimators obtained above. Notice also that, to obtain correct IV residuals and covariance matrix, the influence of e must be omitted from the calculation of s2. The correct covariance matrix is given by s2(WW)-1.
Note that the classical distribution theory does not yield the result that the t-ratio on the coefficient of interest for the Wu test follows the t-distribution with the usual degrees of freedom. The t-ratio in this case is asymptotically normally distributed: a z-test (with a statement of asymptotic justification) is appropriate. If the same estimators of the error variance have been used to construct the Hausman statistic and the Hausman-Wu test, then the square of the t-ratio on the residual e identically equals the Hausman statistic.
Note that using SAS PC or SPSS/PC+ to perform a manual two-stage IV or Hausman-Wu estimation does not automatically produce the correct variance estimator.
In GAUSS-386, two sample programs, HAUSMAN.G and HAUSMNWU.G (Figure 22), illustrate the procedures described above. In both cases, the programs test whether variable X10 is correlated with the stochastic disturbance terms. For SAS PC and SPSS/PC+, it is simpler to use the procedures as indicated in the sample programs, HAUSMNWU.SAS and HAUSMNWU.SPS. Notice that the coefficient estimates, standard errors, and t-ratios are identical for both types of programs (t = 1.6238) and that the Hausman statistic is equal to the square of the t-ratio on the residual u of the Hausman-Wu technique (W = 2.637). The null hypothesis of no endogeneity of X10 cannot be rejected at the 5 percent level.
Recommended references: Berndt (1991, 379-380); Greene (1990, 303); Griffiths, Hill, and Judge (1993, 458-476); Hausman (1978); Kennedy (1985, 71, 80, 119, 138, 187; 1992, 135, 148, 169-170); Kmenta (1986, 365); Maddala (1988, 435-141).
Figure 22 - Sample programs for Hausman test and Hausman-Wu test, in GAUSS-386
22a - HAUSMAN.G program
|
/***************************************************************** FORMAT /M2 /RD 12,4; XO = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3]; NAMESX = NAMES[IX1 IX2 IX8 IX9 IX13 IX14 IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.]; X10 = DATA[.,IX10]; ZO = DATA[.,IX4 IX5 IX6 IX7 IX11 IX12 ]; NAMESZ = NAMES[IX4 IX5 IX6 IX7 IX11 IX12,.]; @-------- OLS ESTIMATION ---------@ X = XO ~ X10; K = COLS(X); B = INV(X'X)*X'Y; @ BETAS @ BOLS = B; " "; I = 1; FORMAT /M1 /RD 12,8; $NAMESX[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; "F"; @-------- INSTRUMENTAL VARIABLES ESTIMATION -------- @ Z = XO ~ ZO; @ NOTE THAT Z HAS ZO @ @ AND ALL X EXCEPT X10 @ K = COLS(X); PZX = INV(X'Z*INV(Z'Z)*Z'X); @ X,Z PROJECTION INV @ " "; " NUMBER OF OBSERVATIONS = ";; NCASE; I = 1; FORMAT /M1 /RD 12,8; $NAMESX[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; @-------- CALCULATION OF HAUSMAN TEST STATISTIC --------@ XXI = INV(X'X); Q = BOLS[K,.] - BIV[K,.]; V = SIV*( PZX[K,K] - XXI[K,K] ); W = Q'INV(V)*Q; DF = 1; PW = CDFCHIC(W,DF); OUTPUT FILE = HAUSMAN.OUT OFF; |
Figure 22b - HAUSMNWU.G program
|
/********************************************************************** FORMAT /M2 /RD 12,4; NAMES = GETNAME("DATA"); Y = DATA[.,IY1]; XO = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3]; NAMESXO = NAMES[IX1 IX2 IX8 IX9 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.]; X10 = DATA[.,IX10]; ZO = DATA[.,IX4 IX5 IX6 IX7 IX11 IX12 ]; NAMESZO = NAMES[IX4 IX5 IX6 IX7 IX11 IX12,.]; @-------- OLS ESTIMATION ---------@ X = XO ~ X10; B = INV(X'X)*X'Y; @ BETAS @ BOLS = B; " "; I = 1; FORMAT /M1 /RD 12,8; $NAMESX[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; "f"; @-------- TWO-STAGE LEAST-SQUARES CALCULATION OF
--------@ @-------- FIRST STAGE ---------@ Z = XO ~ ZO; @ NOTE THAT Z HAS ZO @ @ AND ALL X EXCEPT X10 @ K = COLS(Z); NAMESZ = NAMESXO | NAMESZO; G = INV(Z'Z)*Z'X10; @ OLS OF X10 ON Z @ " "; @ FIRST STAGE. @ " NUMBER OF OBSERVATIONS = ";; NCASE; I = 1; FORMAT /M1 /RD 12,8; $NAMESZ[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; "f"; @-------- SECOND STAGE ESTIMATION --------@ XH = XO ~ X10FIT; K = COLS(XH); B = INV(XH'XH)*XH'Y; @ IV BETAS @ @ USE X NOT XH! @ RSS = E'E; @ RESIDUAL SUM OF SQ @ PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @ BIV = B; " "; I = 1; FORMAT /M1 /RD 12,8; $NAMESX[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; "f"; @-------- THE WU TEST --------@ XW = XO ~ X10 ~ U; B = INV(XW'XW)*XW'Y; @ WU BETAS @ K = COLS(XW) - 1; E = Y - XW[.,1:K]*B[1:K,.]; @ RESIDUALS @ @ OMIT EFFECT OF U @ RSS = E'E; @ RESIDUAL SUM OF SQ @ SER = SQRT(INV(NCASE-K)*RSS); @ STD ERR OF REGRESSION @ PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @ BIV = B; " "; I = 1; FORMAT /M1 /RD 12,8; $NAMESW[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; " "; "f"; OUTPUT FILE = HAUSMNWU.OUT OFF; |
Figure 23 - Sample program for Hausman-Wu test, in SAS PC
|
*********************************************************************** * HAUSMAN TEST WHERE VARIABLE X10 IS SUSPECTED OF BEING
ENDOGENOUS IN THE * VARIABLES X4, X5, X6, X7, X11, AND X12 ARE USED AS * STEP 1: REGRESS X10 AGAINST EXOGENOUS EXPLANATORY VARIABLES
(X1, X2, X8, X9, PROC REG DATA=CDRV.DATA; MODEL X10=X1 X2 X8 X9 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3 X4 X5 X6 X7 X11 X12; OUTPUT OUT=HDATA1 R = RX10; RUN; * STEP 2: RUN ORIGINAL REGRESSION MODEL WITH BOTH X10 AND RX10
AS EXPLANATORY PROC REG DATA=HDATA1 OUTEST=HBETA; MODEL Y1=X1 X2 X8 X9 X10 RX10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3; RUN; * NEED TO ADD CONSTANT TO BOTH DATA SETS TO MERGE BY; DATA HDATA3; SET HDATA2; DATA HBETA2; SET HBETA(RENAME=(X1=CX1 X2=CX2 X8=CX8 X9=CX9 X10=CX10 RX10=CRX10 X13=CX13 X14=CX14 X15=CX15 D1=CD1 D2=CD2 D3=CD3 CONSTANT = 1; * STEP 3: STEP 2 PRODUCES THE CORRECT INSTRUMENTAL VARIABLE (IV)
ESTIMATES, BUT DATA HWRESOK; MERGE HDATA3 HBETA2; BY CONSTANT; RESOK = Y1 - (INTERCEP + CX1*X1 + CX2*X2 + CX8*X8 + CX9*X9 + CX10*X10 + CX13*X13 + CX14*X14 + CX15*X15 + RESOKSQ = RESOK ** 2; PROC SUMMARY DATA=HWRESOK; VAR RESOKSQ RY1SQ CONSTANT; PROC PRINT DATA=SUMRES; DATA RESULTS; SET SUMRES; PROC PRINT DATA=RESULTS; VAR CORFACT; * STEP 4: MULTIPLY T'S FROM STEP 2 BY CORFACT TO GET APPROPRIATE T'S.; * TEST STATISTIC CALCULATION FROM OUTPUT. |
Figure 24 - Sample program for Hausman-Wu test, in SPSS/PC+
|
SET MORE OFF. * HAUSMAN TEST WHERE VARIABLE X10 IS SUSPECTED OF BEING
ENDOGENOUS IN THE * VARIABLES X4, X5, X6, X7, X11, AND X12 ARE USED AS * STEP 1: REGRESS X10 AGAINST EXOGENOUS EXPLANATORY VARIABLES
(X1, X2, X8, X9, REGRESSION VARIABLES = X10 X1 X2 X8 X9 X13 X14 X15 D1 D2 D3 D5
D6 D7 D8 RD1 RD2 RD3 * STEP 2: RUN ORIGINAL REGRESSION MODEL WITH BOTH X10 AND RX10
AS EXPLANATORY REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 RX10 X13 X14 X15 D1 D2
D3 D5 D6 D7 D8 RD1 RD2 RD3 SAVE OUT='HDATA.SYS'. *************************************************************************************************** GET FILE = 'HDATA.SYS'. COMPUTE RESOK = Y1-(X1 * 35.595780 + X2 * 42.396332
+ X8 * -.133989 + X9 * -26.758650 + COMPUTE RESOKSQ = RESOK ** 2. AGGREGATE
OUTFILE=* /BREAK=CONSTANT COMPUTE SIGMAOK = SRESOKSQ / (COUNT - 19). FORMATS ALL (F9.5). * STEP 4: MULTIPLY T'S FROM STEP 2 BY CORFACT TO GET APPROPRIATE T'S. * TEST STATISTIC CALCULATION FROM OUTPUT. FINISH. |
The Levi Bounds (for Assessing the Presence of Measurement Error)
The Levi bounds may be calculated to indicate the presence of measurement error. It is well known that if only one regressor is measured with error, the OLS coefficient of that regressor is biased toward zero. If the roles of this regressor and the dependent variable are reversed in the regression, the coefficient on the artificial regressor is an estimator of the inverse of the coefficient on the original regressor. This estimator is also biased toward zero, but its inverse is biased away from zero. If the coefficient on the original regressor is taken as a lower bound for a consistent estimator and the inverse of the coefficient on the artificial regressor is taken as an upper bound for a consistent estimator, then it is expected that the size of this interval reflects the severity of the measurement error problem.
Levis procedure is very simple to execute, but no formal statistical test is performed. Whether the interval between lower and upper bounds is large is a matter of judgment for the investigator. The steps below are presented in terms of a simple regression model; extension to the multiple regression model is straightforward.
|
Step 1 |
Estimate the regression, Yi=a1+b1Xi++e1, and get |
|
Step 2 |
Run the reverse regression, Now, examine the interval
|
As Kmenta notes, if this interval is small, the effect of measurement error is likely to be bearable and OLS results are unlikely to be severely biased. Note that the above discussion assumes that b1 is positive. If b1 is negative, then the lower and upper bounds are reversed.
The sample programs (Figures 25 through 27) treat variable
X10 as possibly susceptible to measurement error. The results are
striking: 
= 217 and 
= 5,732 (5,747 in SAS PC and SPSS/PC+, due to rounding).
This appears to be a very large interval, particularly in view of the
statistical significance of this regressor. It is concluded that measurement
error is a problem for X10. These estimated coefficients translate into
calorie-income elasticities of 0.1 and 2.0, respectively. From an economic
viewpoint, this is a very large interval.
Recommended references: Kmenta (1986, 346-366); Levi (1977).
Figure 25 - Sample program for Levi bounds test, in GAUSS-386
|
/************************************************************* FORMAT /M2 /RD 12,4; NAMES = GETNAME("DATA"); X = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IX10 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3]; NAME1 = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.]; @-------- OLS ESTIMATION --------@ K = COLS(X); B = INV(X'X)*X'Y; @ BETAS @ B1 = B[6,1]; @ COEFF OF INTEREST @ " "; I = 1; FORMAT /M1 /RD 12,8; $NAME1[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; " "; "f"; @-------- OLS ESTIMATION --------@ X10 = DATA[.,IX10]; X = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IY1 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3]; NAME2 = NAMES[IX1 IX2 IX8 IX9 IY1 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.]; K = COLS(X); B2 = B[6,.]; @ COEFF OF INTEREST @ " "; I = 1; FORMAT /M1 /RD 12,8; $NAME2[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; " "; " "; " "; " BOUNDS FOR THE COEFFICIENT ON X10"; " "; " LOWER BOUND: B =";; B1; " "; B2 = 1/B2; " UPPER BOUND: B =";; B2; "f"; OUTPUT FILE = LEVI.OUT OFF; |
Figure 26 - Sample program for Levi bounds test, in SAS PC
|
************************************************************* * X10 IS THE VARIABLE WE SUSPECT IS MEASURED WITH ERROR. PROC REG DATA=CDRV.DATA; MODEL Y1=X10 X1 X2 X8 X9 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3; RUN; * STEP 2: REVERSE Y1 AND X10 AND RUN THE MODEL IN OLS.; PROC REG DATA=CDRV.DATA; MODEL X10=Y1 X1 X2 X8 X9 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3; RUN; * INTERPRETATION OF OUTPUT. * FOR THIS EXAMPLE, THE LEVI BOUNDS ON X10'S OLS ESTIMATE ARE
216.97 AND |
Figure 27 - Sample program for Levi bounds test, in SPSS/PC+
|
SET MORE OFF. GET FILE = 'DATA.SYS'. * X10 IS THE VARIABLE WE SUSPECT IS MEASURED WITH ERROR. REGRESSION VARIABLES = Y1 X10 X1 X2 X8 X9 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3 /DEPENDENT = Y1 * STEP 2: REVERSE Y1 AND X10 AND RUN THE MODEL IN OLS. REGRESSION VARIABLES = Y1 X10 X1 X2 X8 X9 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3 /DEPENDENT = X10 * INTERPRETATION OF OUTPUT. |
TESTS FOR NONNESTED HYPOTHESES
This class of tests is used to test the validity of one model for explaining y versus another model for explaining y when neither model can be obtained by imposing linear restrictions on the other model. These model validity tests are popular because they allow all competing models to be rejected if all are deficient (unlike model selection methodssuch as high R2 criteria, backwards elimination, or stepwise regressionin which one model will always be chosen).
The following models are nonnested models, because Z is not a subset of W, nor is W a subset of Z:
y=Xb+Zg+e1 (2)
y=Xb+Wd+e2 (3)
In these competing models that explain y, the explanatory variables are contained in X, Z, and W, which are of the dimension N × K1, N × K2, and N × K3 respectively. The coefficient vectors are conformable. It is important to note that tests of these models all assume that the stochastic disturbance terms satisfy the classical assumptions.
Two popular tests for nonnested models, the nonnested F-test and the nonnested J-test, are explained below.
Nonnested F-Test
The strategy of this test is to artificially nest the two competing models in a more general model and then to test whether the restrictions that produce either original model (or both) are valid.
|
Step 1 |
Form the general model: y=Xb+Zg+Wd+e (4) |
|
Step 2 |
Estimate the general model (4) using OLS. |
|
Step 3 |
Use F-tests for incremental explanatory power to test the following three sets of hypotheses: H0: g =
0, Note that the last hypothesis cannot be addressed using the F-tests for coefficients on Z and W: for the last hypothesis you need to construct an F-test for the joint incremental explanatory power of Z and W. |
|
Step 4 |
If the estimates of g or d are not significantly different from zero, the model that includes the corresponding set of variables is rejected. If both sets of coefficients are significantly different from zero, then the general model (4) is preferred; if neither is significantly different from zero, then the restricted model, y=Xb+e, (5) may be adequate. |
In the sample programs (Figures 28 through 30), X is taken to include a constant and variables X1, X2, X8, X9, X10, X13, X14, X15, D1, D2, D3, D5, D6, D7, D8, RD1, RD2, and RD3. Then Z = [X3, X7] and W = [X6, X12].
For the sample data set, the F-statistic for the hypothesis that g = 0 is 2.5261 (P-value = 0.0803): the variables Z should be retained in the model. The F-statistic for the hypothesis that d= 0 is 1.9970 (P-value = 0.1361): the variables W only have significant explanatory power at a significance level of, say 15 percent. Investigators who prefer to use smaller significance levels, say 10 percent or 5 percent, would fail to reject this null hypothesis and would choose model 2 over model 3 at this point (that is, include Z but not W). Finally, the F-statistic for the hypothesis g= d= 0 is 2.2438 (P-value = 0.0622), and, at the 7 percent significance level, it is concluded that Z and W are jointly significant. The completely unrestricted model is most appropriate.
This test and several others in this manual are F-tests for linear restrictions on coefficients. Good general expositions of F-tests are given in Greene (1990, Chapter 7) and Kmenta (1986, Section 10-2). See Testing for Structural Change in this manual for a fuller exposition of an F-test.
Recommended references: Davidson and MacKinnon (1981, 781-793); Greene (1990, 231-234); Kennedy (1985, 70, 79-80, 85-87; 1992, 81, 87-88); Kmenta (1986, 595-600); MacKinnon (1983, 85-158); Maddala (1988, 443-446); McAleer and Pesaran (1986, 217-371).
Figure 28 - Sample program for nonnested F-test, in GAUSS-386
|
/**************************************************************** OUTPUT FILE = NNESTF.OUT RESET; @-------- SELECT VARIABLES THAT WILL BE USED --------@ Y1 = DATA[.,IY1]; ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3]; Z = DATA[.,IX3 IX7]; W = DATA[.,IX6 IX12]; @-------- SELECT VARIABLE NAMES CORRESPONDING WITH VARIABLES
--------@ NAMESU = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3 NAMES1 = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3 NAMES2 = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3 NAMES3 = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.]; @ -------- MODEL U ----------@ @ -------- UNRESTRICTED MODEL THAT INCLUDES BOTH Z AND W ----------@ X = X0 ~ Z ~ W; K0 = COLS(X); B = INV(X'X)*X'Y1; @ OLS ESTIMATION @ @ --------- PRINT RESULTS -----------@ " OLS RESULTS FOR UNRESTRICTED MODEL "; I = 1; FORMAT /M1 /RD 12,8; $NAMESU[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; " "; "f"; @ --------- MODEL 1 -----------@ @ --------- RESTRICTED MODEL THAT EXCLUDES W -----------@ X = X0 ~ Z; B = INV(X'X)*X'Y1; @ OLS ESTIMATION @ @ --------- PRINT RESULTS -----------@ " OLS RESULTS FOR RESTRICTED MODEL THAT EXCLUDES W"; I = 1; FORMAT /M1 /RD 12,8; $NAMES1[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; " "; @ --------- MODEL 2 -----------@ X = X0 ~ W; B = INV(X'X)*X'Y1; @ OLS ESTIMATION @ @ --------- PRINT RESULTS -----------@ " OLS RESULTS FOR RESTRICTED MODEL THAT EXCLUDES Z"; I = 1; FORMAT /M1 /RD 12,8; $NAMES2[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; " "; @ --------- MODEL 3 -----------@ X = X0; K3 = COLS(X); B = INV(X'X)*X'Y1; @ OLS ESTIMATION @ E = Y1 - X*B; @ RESIDUALS @ @ --------- PRINT RESULTS -----------@ " OLS RESULTS FOR RESTRICTED MODEL THAT EXCLUDES Z AND W"; I = 1; FORMAT /M1 /RD 12,8; $NAMES3[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; " "; @--------- F-TESTS FOR INCREMENTAL EXPLANATORY POWER ----------@ F1 = ( (RSSR1 - RSSU)/(K0 - K1) ) / ( RSSU/(NCASE - K0)
); F2 = ( (RSSR2 - RSSU)/(K0 - K2) ) / ( RSSU/(NCASE - K0)
); F3 = ( (RSSR3 - RSSU)/(K0 - K3) ) / ( RSSU/(NCASE - K0)
); " F-TESTS FOR INCREMENTAL EXPLANATORY POWER "; " "; "f"; OUTPUT FILE = NNESTF.OUT OFF; |
Figure 29 - Sample program for nonnested F-test, in SAS PC
|
************************************************************** LIBNAME CDRV 'C:DATA'; * ALL VARIABLES EXCEPT X3, X7, X6 AND X12 ARE COMMON TO ALL MODEL Y1=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3 X3 X7 X6 X12; B1 : TEST X6=X12=0; RUN; * THE 'TEST' COMMANDS PRODUCE THE 3 F-STATISTICS DESCRIBED IN
THE TEXT; |
Figure 30 - Sample program for nonnested F-test, in SPSS/PC+
|
SET MORE OFF. GET FILE = 'DATA.SYS'. * ALL VARIABLES EXCEPT X3, X7, X6 AND X12 ARE COMMON TO ALL
MODELS. * STEP 1: ESTIMATE SPECIFICATION 1. REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3
D5 D6 D7 D8 RD1 RD2 RD3 X3 X7 * STEP 2: ESTIMATE SPECIFICATION 2. REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3
D5 D6 D7 D8 RD1 RD2 RD3 X6 X12 * STEP 3: ESTIMATE SPECIFICATION 3 (COMPLETELY RESTRICTED MODEL). REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3
D5 D6 D7 D8 RD1 RD2 RD3 * STEP 4: ESTIMATE SPECIFICATION 4 (COMPLETELY UNRESTRICTED MODEL). REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3
D5 D6 D7 D8 RD1 RD2 RD3 X3 X7 X6 X12 * TEST STATISTIC CALCULATION FROM OUTPUT. |
Nonnested J-Test
The J-test, developed by R. Davidson and J. G. MacKinnon, can be used to test whether one of two models having different (but possibly overlapping) sets of regressors has greater explanatory power than the other. Once again, it is assumed that the stochastic disturbance terms satisfy the classical assumptions. Let the competing models be
y=Xb+e1, and (6)
y=Zd+e2 (7)
The J-test proceeds in the following steps:
|
Step 1 |
Estimate the second equation by OLS and calculate the fitted
values of y, |
|
Step 2 |
Specify the augmented regression model,
where l is a scalar coefficient. Estimate this augmented model by OLS. If some of the explanatory variables in Z have significant explanatory power for y that is not captured by the regressors in X, then the estimate for l will be statistically significant. |
|
Step 3 |
The standard t-ratio produced by statistical packages is asymptotically distributed as standard normal and may be compared to standard normal critical values to test the following hypothesis (see Greene 1990, 231-233): H0: l=0 If H0 is rejected in favor of H1, then the second model has some explanatory power that is lacking in the first model. |
|
Step 4 |
Reverse the roles of the two models and repeat the exercise. |
Note that it is possible that, in both cases, the null hypothesis might be rejected. If both are rejected, then each model explains some variation that the other fails to explain; the investigator may consider some augmented model that includes regressors from both X and Z. If the null hypothesis is not rejected in both cases, then neither is preferred on the basis of this test. The investigator must use economic theory and/or other statistical results to choose.
The sample programs that illustrate this section (Figures 31 through 33) specify and test the following models:
y=Xb+Zg+e1 (8)
y=Xb+Wd+e2 (9)
These models are exactly the ones described in the preceding section, on the nonnested F-test.
In these results, the coefficient for YHAT2 (the fitted y values from model 7) in augmented specification 1 is 1.0372 with t-statistic = 2.2322 (P-value = 0.0258). This indicates that variables contained in W would contribute significant incremental explanatory power if included in model 6. By the same token the coefficient on YHAT1 in augmented specification 2 is 0.9560 with t-statistic = 1.8920 (P-value = 0.0586). This indicates that variables contained in Z would contribute significant incremental explanatory power if included in model 7. As expected, these results are qualitatively similar to those in the section on the nonnested F-test. Neither model dominates, and it appears that a model that includes variables from both specifications is called for. Notice that the t-statistics of the coefficients not associated with the fitted values in the augmented regressions are all quite small. This is because much of their explanatory power has been captured by the fitted y values and the fitted y values are collinear with the remaining variables. Figures 31 through 33 are sample programs for the nonnested J-test.
Recommended references: Davidson and MacKinnon (1981, 781-793); Greene (1990, 231-234); Judge et al. (1984, 884-885); Kennedy (1985, 70, 79-80, 85-87; 1992, 81, 87-88); Kmenta (1986, 595-600); Maddala (1988, 443-447); McAleer and Pesaran (1986).
Figure 31 - Sample program for nonnested J-test, in GAUSS-386
|
/***************************************************************** OUTPUT FILE = NNESTJ.OUT RESET; ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3]; NAMES1 = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3 IX3 IX7,.]; NAMES2 = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3 IX6 IX12,.]; Z = DATA[.,IX3 IX7]; @-------- CALCULATE FITTED YS FROM THE ALTERNATIVE MODELS --------@ X1 = X0 ~ Z; @-------- AUGMENTED REGRESSION 1 --------@ X1 = X1 ~ YHAT2; " REGRESSION RESULTS FOR AUGMENTED SPECIFICATION 1 "; I = 1; FORMAT /M1 /RD 12,8; $NAMES1[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; " YHAT2 ";; PRN[K1,.]; "f"; @-------- AUGMENTED REGRESSION 2 --------@ X2 = X2 ~ YHAT1; E2 = Y - X2*B2; @ RESIDUALS @ " REGRESSION RESULTS FOR AUGMENTED SPECIFICATION 2 "; I = 1; FORMAT /M1 /RD 12,8; $NAMES2[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; " YHAT1 ";; PRN[K2,.]; "f"; OUTPUT FILE = NNESTJ.OUT OFF; |
Figure 32 - Sample program for nonnested J-test, in SAS PC
|
****************************************************************** LIBNAME CDRV 'C:DATA'; * ALL VARIABLES EXCEPT X3, X7, X6 AND X12 ARE COMMON TO ALL * TO TEST SPECIFICATION 1 : FIRST ESTIMATE SPECIFICATION 2.; PROC REG DATA=CDRV.DATA; MODEL Y1=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3 X6 X12; OUTPUT OUT=HAT2 P=YHAT2; RUN; * TO TEST SPECIFICATION 1 : NEXT FORCE PREDICTED VALUE FROM
SPECIFICATION 2 PROC REG DATA=HAT2; MODEL Y1=YHAT2 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3 X3 X7; RUN; * TO TEST SPECIFICATION 2 : FIRST ESTIMATE SPECIFICATION 1; PROC REG; MODEL Y1=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3 X3 X7; OUTPUT OUT=HAT1 P=YHAT1; RUN; * TO TEST SPECIFICATION 2 : NEXT FORCE PREDICTED VALUE FROM
SPECIFICATION 1 PROC REG DATA=HAT1; MODEL Y1=YHAT1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3 X6 X12; RUN; * TEST STATISTIC CALCULATION FROM OUTPUT; |
Figure 33 - Sample program for nonnested J-test, in SPSS/PC+
|
SET MORE OFF. GET FILE = 'DATA.SYS'. * ALL VARIABLES EXCEPT X3, X7, X6 AND X12 ARE COMMON TO ALL * TO TEST SPECIFICATION 1 : FIRST ESTIMATE SPECIFICATION 2. REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3
D5 D6 D7 D8 RD1 RD2 RD3 X6 X12 * TO TEST SPECIFICATION 1 : NEXT FORCE PREDICTED VALUE FROM
SPECIFICATION 2 REGRESSION VARIABLES = Y1 YHAT2 X1 X2 X8 X9 X10 X13 X14 X15 D1
D2 D3 D5 D6 D7 D8 RD1 RD2 RD3 X3 X7 * TO TEST SPECIFICATION 2 : FIRST ESTIMATE SPECIFICATION 1. REGRESSION VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3
D5 D6 D7 D8 RD1 RD2 RD3 X3 X7 * TO TEST SPECIFICATION 2 : NEXT FORCE PREDICTED VALUE FROM
SPECIFICATION 1 REGRESSION VARIABLES = Y1 YHAT1 X1 X2 X8 X9 X10 X13 X14 X15 D1
D2 D3 D5 D6 D7 D8 RD1 RD2 RD3 X6 X12 * TEST STATISTIC CALCULATION FROM OUTPUT. |
OMISSION OF VARIABLES: THE RAMSEY RESET TEST
This version of the Regression Specification Error Test (RESET) may be used to test for omission of relevant explanatory variables. When one or more relevant variables (either unobserved or unobservable) are omitted from a model, the error term of the incorrect model includes the influence of the omitted variables. If proxy variable(s), Z, can be constructed to stand in for the omitted variable(s), a specification error test may be formed by testing if Z has significant incremental explanatory power for y.
In this version of RESET, a proxy variable matrix Z is constructed from the second, third, and fourth moments of the fitted values of y from the original model.
Let the model of interest be
y=Xb+e (10)
This model is restricted in the sense that it does not contain the proxy variables in matrix Z. The augmented model does contain them.
The RESET test is then conducted following the steps described below.
|
Step 1 |
Using OLS, estimate the restricted model (10). |
|
Step 2 |
Calculate fitted values: |
|
Step 3 |
Form the proxy variables as powers of the fitted
values: |
|
Step 4 |
Estimate the augmented model by OLS: regress y on
|
|
Step 5 |
Using an F-test, check if the coefficients on the columns of the Z matrix are jointly significant. If so, the null hypothesis of no specification error is rejected. |
In the sample programs for the nonnested F-test and the nonnested J-test (previously discussed), we examined whether a model that contained variables X3 and X7 or variables X6 and X12 was to be preferred. Evidence was found that the preferred model would contain all four variables. In illustrating the RESET test, all of these variables will be omitted in forming the restricted model to check whether the RESET test detects this omission.
In fact, the F-test for incremental explanatory power yields an F-value of 0.6024 (P-value = 0.6135), and it is concluded that specification error is absent. The previous tests used X3 and X7, and X6 and X12, directly, but the RESET test uses no specific information about these variables. Thus, it is illustrated that the RESET test may not be powerful for detecting misspecification. If specific variables are to be tested to determine whether they should be included in a regression model, they should be tested explicitly rather than through a nonspecific test like RESET.
Figures 34 through 36 are sample programs for the Ramsey RESET Test.
|
NOTES: 1. Thursby (1979, 1981, 1982) discusses using RESET in conjunction with tests for other types of specification error. 2. A method that has been shown by Monte Carlo studies to be
preferable to using powers of |
Recommended references: Griffiths, Hill, and Judge (1993, 498-499); Judge et al. (1984, 364); Kennedy (1985, 71, 81; 1992, 95, 102, 104); Kmenta (1986, 452-455); Maddala (1988, 162, 407); Ramsey (1969, 350-371); Thursby (1979, 222-225; 1981, 117-123; 1982, 314-321); Thursby and Schmidt (1977, 635-641).
Figure 34 - Sample program for the Ramsey RESET Test, in GAUSS-386
|
/************************************************************** FORMAT /M2 /RD 12,4; X = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IX10 IX13 IX14
IX15 NAMES = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.]; @-------- OLS ESTIMATION OF "RESTRICTED" MODEL --------@ KR = COLS(X); B = INV(X'X)*X'Y; @ BETAS @ " "; I = 1; FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; " "; @-------- RESET VARIABLES --------@ Y2 = YHAT^2; @-------- OLS ESTIMATION OF "UNRESTRICTED" REGRESSION --------@ X = X ~ Y2 ~ Y3 ~ Y4; KU = COLS(X); B = INV(X'X)*X'Y; @ BETAS @ " "; I = 1; FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; " "; " Y2 ";; PRN[20,.]; " Y3 ";; PRN[21,.]; " Y4 ";; PRN[22,.]; " "; " "; @--- F-STAT FOR INCREMENTAL EXPLANATORY POWER OF RESET VARIABLES ---@ F = ( (RSSR-RSSU) / (KU - KR) )/( RSSU / (NCASE-KU) ); PROB = CDFFC(F,(KU-KR),(NCASE-KU)); " "; "f"; OUTPUT FILE = RESET.OUT OFF; |
Figure 35 - Sample program for the Ramsey RESET Test, in SAS PC
|
************************************************************** LIBNAME CDRV 'C:DATA'; PROC REG DATA=CDRV.DATA; MODEL Y1=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3; RUN; DATA YHATX; SET HAT; RUN; PROC REG DATA=YHATX; MODEL Y1=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3 YHAT2 YHAT3 YHAT4; FTEST : TEST YHAT2, YHAT3, YHAT4; RUN; * TEST STATISTIC CALCULATION FROM OUTPUT. |
Figure 36 - Sample program for the Ramsey RESET Test, in SPSS/PC+
|
SET MORE = OFF. GET FILE = 'DATA.SYS' . * RESTRICTED REGRESSION. D7 D8 RD1 RD2 RD3 COMPUTE YHAT2 = YHAT**2. * UNRESTRICTED REGRESSION. RD1 RD2 RD3 YHAT2 YHAT3 YHAT4 * THE LOW TOLERANCE CRITERIA IS EMPLOYED TO FORCE YHAT2 AND
YHAT3 INTO * TEST STATISTIC CALCULATION FROM OUTPUT. |
MULTI-COLLINEARITY DIAGNOSTICS
Multicollinearity exists when there is a linear relationship among some subset of regressors in a model. Multicollinearity exists in virtually every data set but is a problem only when the linear relationship among regressors is very strong. The main effects of high multicollinearity are that the variances of the estimated coefficients are inflated and the t-statistics are consequently small; and, in extreme cases, the coefficients may be very sensitive and unstable with respect to minor changes in model specification and data.
Since multicollinearity is essentially a matter of degree, attention has focused on descriptions of its extent and on assessments of the extent to which it inflates the variances of the coefficients. Two popular methods for assessing the strength of multicollinearity are discussed below.
Auxiliary Regressions
This is more useful than the popular method of simply looking at the correlation matrix of regressors, since the latter only reveals pair-wise relationships between variables. The auxiliary regression method makes use of the fact that the R2 statistic is a measure of the extent to which one variable is a linear combination of a set of other variables. The strategy is to regress each continuous regressor, in turn, on all remaining regressors and to check the R2 of each auxiliary regression. High R2 values indicate the existence of strong linear dependencies. If only one linear relationship is very strong, then it provides an indication of which variable is suspect. However, if more than one linear dependency is strong, then the multicollinearity is more generally distributed among the regressors.
The steps for performing auxiliary regressions and interpreting their results are described below.
|
Step 1 |
Specify the first explanatory variable as the dependent variable and perform OLS, using the remainder of the explanatory variables (including a constant) as regressors. |
|
Step 2 |
Calculate R2 for this regression. A high R2 (one rule of thumb might be approximately 0.90 or above) indicates that the first explanatory variable is a strong linear function of the remaining explanatory variables. This general rule of thumb should be used as a benchmark, not as a strict bound. |
|
Step 3 |
Repeat steps 1 and 2 for each of the continuous explanatory variables in turn. |
For the eight continuous regressors in the standard model in the sample programs (Figures 37 through 39), the R2 values for the auxiliary regressions range from 0.0895 to 0.8241. Therefore, it is concluded that multicollinearity is not severe.
Recommended references: Fomby, Hill, and Johnson (1984, 293-294); Greene (1990, 277-281); Griffiths, Hill, and Judge (1993, 436-437); Judge et al. (1984, 902-904); Kennedy (1985, 150, 153; 1992, 179-180, 183-184).
Figure 37 - Sample program for performing auxiliary regressions, in GAUSS-386
|
/*************************************************************************** FORMAT /M2 /RD 12,4; X = DATA[.,IX1 IX2 IX8 IX9 IX10 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3]; K = COLS(X); NAMES = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.]; " "; I = 1; XA = X[.,I]; XX = X[.,2:K]; ENDIF; XX = X[.,1:(I-1)] ~ X[.,(I+1):K]; ENDIF; XX = X[.,1:(K-1)]; ENDIF; ENDO; "f"; OUTPUT FILE = AUXREG.OUT OFF; |
Figure 38 - Sample program for performing auxiliary regressions, in SAS PC
|
**************************************************************************** LIBNAME CDRV 'C:DATA'; * THIS IS THE MODEL TO BE ESTIMATED; * BELOW ARE THE 8 AUXILIARY REGRESSIONS FOR THE CONTINUOUS VARIABLES; M1: MODEL X1=X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1
RD2 RD3; |
Figure 39 - Sample program for performing auxiliary regressions, in SPSS/PC+
|
SET MORE = OFF. GET FILE = 'DATA.SYS'. RD1 RD2 RD3 * BELOW ARE THE 8 AUXILIARY REGRESSIONS FOR THE CONTINUOUS VARIABLES. REGRESSION VARIABLES = X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3 REGRESSION VARIABLES = X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3 REGRESSION VARIABLES = X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3 REGRESSION VARIABLES = X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3 REGRESSION VARIABLES = X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3 REGRESSION VARIABLES = X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3 REGRESSION VARIABLES = X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3 REGRESSION VARIABLES = X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3 * TEST STATISTIC CALCULATION FROM OUTPUT. * OBSERVE R-SQUARED IN EACH REGRESSION. ONE RULE OF THUMB IS THAT AN * R-SQUARED VALUE OF 0.9 OR HIGHER INDICATED SERIOUS COLLINEARITY. THIS * IS A GENERAL RULE OF THUMB, NOT A STRICT BOUND. NONE OF THE 8 * AUXILLIARY REGRESSIONS IN THIS EXAMPLE HAS AN R-SQ. ABOVE 0.9. FINISH. |
Condition Indices and the Condition Number
Strong multicollinearity among the regressors implies that at least one eigenvalue or characteristic root of the (XX) matrix is small. Condition indices are the square roots of the ratios of the largest eigenvalue of the standardized (XX) matrix to the remaining eigenvalues. The condition number is the largest of these values, that is, the square root of the ratio of the largest to the smallest eigenvalue. SAS PC and SPSS/PC+ both produce multicollinearity diagnostics based on condition indices as options of their regression routines. It is also easy to produce them in GAUSS-386. The steps described below may be followed in GAUSS-386.
|
Step 1 |
Compute the square roots of the diagonal elements of (XX). Use these to form a diagonal matrix (zeros except on the diagonal), then invert the diagonal matrix and call this result S. |
|
Step 2 |
Form the K × K matrix Z = SXXS. |
|
Step 3 |
Calculate the vector l containing the K eigenvalues of Z; identify the smallest one as lmin and the largest one as lmax. |
|
Step 4 |
Compute the vector of condition indices C as follows: C=(lmax/l)1/2. The largest of these indices is the condition number. |
Extensive experimentation conducted by Belsley, Kuh, and Welsch (1980) suggests that condition indices in excess of 30 indicate the presence of multicollinearity; condition indices in excess of a few hundred indicate severe multicollinearity. In the sample programs, three condition indices are larger than 30 and one is greater than 100, which is consistent with the results of the auxiliary regressionsmulticollinearity is moderate. Figures 40 through 42 are sample programs for determining the condition number.
|
NOTE: Belsley, Kuh, and Welsch (1980) present measures that describe the extent to which variances of estimated coefficients may be inflated because of the presence of multicollinearity; they also present measures to identify which regressors are most problematic. SPSS/PC+ and SAS PC have a preprogrammed option called Variance Decomposition Proportion, which helps to identify the variables that are involved in multicollinearity. |
Recommended references: Belsley, Kuh, and Welsch (1980, chapter 3); Corlett (1990, 158-159); Greene (1990, 281); Johnston (1984, 249-250); Judge et al. (1984, 902, 914, 920); Kennedy (1985, 150, 153; 1992, 180, 183); Kmenta (1986, 439); Maddala (1988, 228).
Figure 40 - Sample program for determining the condition number, in GAUSS-386
|
/*************************************************************************** FORMAT /M2 /RD 12,4; OUTPUT FILE = CONDNUM.OUT ON; NAMES = GETNAME("DATA"); Y = DATA[.,IY1]; X = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IX10 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3]; NAMES = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.]; @-------- OLS ESTIMATION --------@ K = COLS(X); B = INV(X'X)*X'Y; @ BETAS @ PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @ " "; I = 1; FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; " "; @------- FORM SCALED VERSION OF (X'X) --------@ D = SQRT(DIAG(X'X)); @-------- COMPUTE EIGENVALUES OF Z --------@ L = EIGRS(Z); LMIN = MINC(L); CONDINDX = SQRT(LMAX./L); " "; "f"; OUTPUT FILE = CONDNUM.OUT OFF; |
Figure 41 - Sample program for determining the condition number, in SAS PC
|
**************************************************************************** LIBNAME CDRV 'C:DATA'; * THIS IS THE MODEL TO BE ESTIMATED; MODEL Y1=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3 / COLLIN; RUN; * TEST STATISTIC CALCULATION FROM OUTPUT. |
Figure 42 - Sample program for determining the condition number, in SPSS/PC+
|
SET MORE OFF. GET FILE = 'DATA.SYS' . RD1 RD2 RD3 * TEST STATISTIC CALCULATION FROM OUTPUT. |
TESTING FOR STRUCTURAL CHANGE
The Chow F-Test
The Chow F-test, more commonly known as the Chow test, is a simple way to test if the underlying parameter values for a data set change across specified subsets of that data: across different time periods or household types, for example. The Chow test compares the RSS from a restricted model (that assumes that the parameters are constant across data subsets) with the RSS from an unrestricted model (that allows the parameters to vary across data subsets). The unrestricted RSS may be obtained by running separate regressions for the data subsets and summing the resulting RSSs or, alternatively, by running a single regression that includes a set of dummy and dummy-interaction variables that distinguish among the subsets of the data. Both methods are simple and they have identical results. Both are presented below, in GAUSS-386. In SAS PC and SPSS/PC+, only the second approach is presented. For the programs discussed here, the question of whether the data from round 1 surveys are distinct from the data drawn from the other three rounds is investigated.
This example is slightly more complicated to program than typical examples of the Chow test because of the presence of two dummies to distinguish among the three rounds in the second data subset. In effect, distinct intercepts for all survey rounds are permitted, and this example only tests whether slope coefficients are distinct between round 1 and the other three rounds. The models used in this example are as follows:
· Round 1 model (RD2 = RD3 = RD4 = 0):
where X contains neither an intercept nor any round dummies.
· Rounds 2 through 4 model (RD1 = 0):
where X is as described in the round 1 model, and RD3 and RD4 introduce intercept differentials for the third and fourth rounds.
Note that RD4 is not contained in the data set, but can be constructed from knowledge of RD1, RD2, and RD3.
· Restricted model (only intercepts allowed to vary):
First Approach: Estimating Separate Models for Two Data Subsets (GAUSS-386). In the first approach, the data are split into subsets and a separate model is estimated from each:
|
Step 1 |
Separate the data into two data subsets: one from the first round of the survey (RD1 = 1) and one from the other rounds (RD1 = 0). | |
|
Step 2 |
Run three regressions: First: Estimate the Round 1 model for the data set for which RD1 = 1 and retain the RSS. Call it RSS1. Second: Estimate the Round 2 through 4 model for the data set for which RD1 = 0 and retain the RSS. Call it RSS2. Third: Estimate the restricted model for the full data set and retain the RSS. Call it RSSR for restricted RSS. |
|
|
Step 3 |
The unrestricted RSS is RSSU = RSS1 + RSS2 | |
|
Step 4 |
Form the test statistic
Here, the numerator degrees of freedom is equal to the number of restrictions (the number of slope coefficients that are forced to be equal across the two models equals 15 in the sample programs) and the denominator degrees of freedom is equal to the degrees of freedom associated with the unrestricted model (sample size minus the total number of coefficients estimated in the unrestricted model[s]). |
Second Approach: Dummy Variables (GAUSS-386, SAS PC, and SPSS/PC+ programs). In the second approach, dummy variables are used:
|
Step 1 |
Let RD1 be the dummy variable that identifies the first-round survey observations. Form the matrix of interaction variables DX = RD1.*X, where .* is element-by-element multiplication of each row in X by corresponding elements of RD1 (15 rows in the sample programs). |
|
Step 2 |
Estimate the unrestricted model by OLS: y=b0+Xb+DXd+d2RD2+d3RD3+d4RD4+e This is the unrestricted model, because the presence of the dummy interaction variables allows differential effects across subsamples for all slope coefficients. |
|
Step 3 |
Estimate the restricted model by OLS: y=b0+Xb+d2RD2+d3RD3+d4RD4+e Comparing the restricted and unrestricted models, it is evident that the hypothesis to be tested is H0: d=0, and H1: d ¹0. |
|
Step 4 |
Compute the test statistic exactly as in step 4 above. |
Both approaches to the test produce an F-statistic of 1.191 (df1, df2) = (15,190), hence the null hypothesis of equal slope coefficients in round 1 versus rounds 2 through 4 (no structural change) cannot be rejected.
The Chow test is applicable to a wide variety of hypotheses; this example shows only one case. Refer to the references for additional applications. Figures 43 through 45 are sample programs for the Chow test.
Recommended references: Chow (1960, 591-605); Fomby, Hill, and Johnson (1984, 197-199); Greene (1990, 218-222); Johnston (1984, 207-225); Kennedy (1985, 87-88, 186; 1992, 98, 108-109); Kmenta (1986, 420-422); Maddala (1988, 134).
Figure 43 - Sample program for Chow test, in GAUSS-386
|
/************************************************************************ FORMAT /M2 /RD 12,4; NAMES = GETNAME("DATA"); Y = DATA[.,IY1]; X = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IX10 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3]; RD1 = DATA[.,IRD1]; NAMES = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.]; @-------- FIRST APPROACH: RESTRICTED REGRESSION
--------@ @-------- RESTRICTED REGRESSION --------@ K = COLS(X); PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @ " "; I = 1; FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; " "; "f"; @-------- REGRESSION ON FIRST-ROUND (RD1 = 1) SUBSET -------@ Y1 = SELIF(Y,RD1); B = INV(X1'X1)*X1'Y1; @ BETAS @ " "; " INTERCEPT ";; PRN[1,.]; I = 1; FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; " "; "f"; @-------- REGRESSION ON NON-FIRST-ROUND DATA --------@ Y2 = DELIF(Y,RD1); X2 = X2[.,1:K K+2 K+3]; NAME2 = NAMES[1:(K-1) K+1 K+2,.]; K = COLS(X2); " "; I = 1; FORMAT /M1 /RD 12,8; $NAME2[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; " "; RSSU = RSS1 + RSS2; F = ( (RSSR - RSSU)/DFN ) / (RSSU/DFD); PROBF = CDFFC(F,DFN,DFD); " "; " RESULTS FOR SUBSET REGRESSION APPROACH"; "f"; @-------- SECOND APPROACH: RESTRICTED REGRESSION
--------@ K = COLS(X); DX = RD1 .* X[.,2:(K-3)]; NAMES = NAMES | "DX1" | "DX2" | "DX8" | "DX9" | "DX10" | "DX13" | X = X ~ DX; K = COLS(X); @-------- UNRESTRICTED DUMMY-VARIABLE REGRESSION --------@ B = INV(X'X)*X'Y; @ BETAS @ PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @ " "; I = 1; FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; " "; DFN = (K-4)/2; F = ( (RSSR - RSSU)/DFN ) / (RSSU/DFD); PROBF = CDFFC(F,DFN,DFD); " "; " RESULTS FOR DUMMY-VARIABLE APPROACH"; OUTPUT FILE = CHOW.OUT OFF; |
Figure 44 - Sample program for Chow test, in SAS PC
|
******************************************************************************** * THE NULL HYPOTHESIS BEING TESTED IS THAT THE SLOPE
COEFFICIENTS ON LIBNAME CDRV 'C:DATA'; DATA DAT2; SET CDRV.DATA; DX1 = RD1*X1; RUN; PROC REG DATA=DAT2; MODEL Y1= X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3 B1 : TEST DX1=DX2=DX8=DX9=DX10=DX13=DX14=DX15= DD1=DD2=DD3=DD5=DD6=DD7=DD8=0; RUN; * THE F-TEST STATISTIC IS CALCULATED FROM THE "B1: TEST"
COMMAND; |
Figure 45 - Sample program for Chow test, in SPSS/PC+
|
SET MORE = OFF. * THE NULL HYPOTHESIS BEING TESTED IS THAT THE SLOPE
COEFFICIENTS ON GET FILE = 'DATA.SYS' . COMPUTE DX1 = RD1*X1. *UNRESTRICTED MODEL. RD1 RD2 RD3 *RESTRICTED MODEL. RD1 RD2 RD3 * TEST STATISTIC CALCULATION FROM OUTPUT. |
TESTING FOR NONLINEAR VARIABLES
The linearity assumption of the Classical Linear Regression Model refers to the assumption that the parameters enter the equation linearly. No such assumption is required concerning the manner in which the variables enter the equation. However, it is common to specify that the variables enter linearly. If this is inappropriate, then the consequences are similar to other forms of misspecification, such as the omission of relevant explanatory variables. In fact, if the Taylor theorem is used, inappropriate functional forms may be viewed as a special case of the omitted variables problem (Kmenta 1986, 449-451). Because of the similarity of the two problems, test results that indicate inappropriate functional form may actually be revealing an omitted variable problem. One test that is less susceptible to this problem is Utts Rainbow test.
Utts Rainbow Test
This test is related to the Chow test for structural stability, with the sample divided into two subsamples according to the observations influence (or leverage) on the regression results. If observations with high leverage displace the regression results significantly, then it may be concluded that the specification of the regression function is inadequate. The test makes use of a measure of leverage that is also used to detect influential outliers in a regression.
The model is the standard one:
y=Xb+e
The test is based on the difference in the RSS from the restricted regression (same model applies to all observations) and the RSS from the unrestricted regression (on observations that have small leverage). The null hypothesis is that this difference is zero. Keep in mind that this test assumes that the stochastic disturbance terms satisfy the classical assumptions. If they do not, then the test is not valid. Here, proceed under the assumption that the classical assumptions are satisfied.
|
Step 1 |
Perform OLS on the full data set and retain the residual sum of squares RSSR (restricted RSS). |
|
Step 2 |
Compute the leverage measure for each observation in X:
where xi is the ith row of X. Sort the leverage measures into ascending order and select the half that are smallest. Identify observations in X and Y that correspond with the small leverage measures. |
|
Step 3 |
Perform OLS on the subsample selected in step 2, and retain the residual sum of squares RSSU (unrestricted). |
|
Step 4 |
Calculate the statistic U:
where K = the number of estimated coefficients. |
A rejection of the null hypothesis implies that the functional form is inadequate. For these sample programs, U = 1.195 (F-critical = 1, P-value = 0.0058), so that the null hypothesis is rejected. Recall, however, that this model omits X3, X6, X7, and X12, and that heteroskedasticity afflicts the disturbances. An improved test would be to include the additional variables known to be significant and to correct for heteroskedasticity before conducting the Rainbow test. Figures 46 through 48 are sample programs for Utts Rainbow test.
Recommended references: Kennedy (1992, 104); Kmenta (1986, 454 - 455); Krr et al. (1985, 120-121); Utts (1982, 2801-2815).
Figure 46 - Sample program for Utts Rainbow test, in GAUSS-386
|
/*********************************************************************** FORMAT /M2 /RD 12,4; NAMES = GETNAME("DATA"); Y = DATA[.,IY1]; X = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IX10 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3]; NAMES = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.]; @-------- OLS ESTIMATION --------@ K = COLS(X); " "; I = 1; FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; " "; "f"; @-------- CONSTRUCT VECTOR OF LEVERAGE MEASURES --------@ @-------- THE MATRIX X CONTAINS "OBSERVATION --------@ @-------- NUMBER" IN THE FIRST COLUMN AND THE --------@ @-------- CORRESPONDNING LEVERAGE MEASURE IN THE --------@ @-------- SECOND COLUMN. --------@ N = NCASE; XXI = INV(X'X); DO WHILE I <= N; @ LOOP OVER WHOLE SAMPLE @ Z = X[I,.]; I = I + 1; ENDO; @ END OF LOOP @ @-------- SORT H BY THE MAGNITUDE OF THE LEVERAGE --------@ H = SORTC(H,2); @-------- CHOOSE ELEMENTS OF X AND Y THAT CORRESPOND
--------@ YS = Y[M[.,1],.]; @-------- OLS ON SUBSET OF DATA HAVING SMALL
--------@ NS = ROWS(XS); PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @ " "; I = 1; FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; " "; @-------- CALCULATION OF THE RAINBOW TEST STATISTIC --------@ DFN = N/2; @ NUMERATOR D.F. @ U = ( (RSS - RSSS) / DFN ) / ( RSSS / DFD ) ; PU = CDFFC(U,DFN,DFD); " "; "f"; OUTPUT FILE = RAINBOW.OUT OFF; |
Figure 47 - Sample program for Utts Rainbow test, in SAS PC
|
************************************************************************** LIBNAME CDRV 'C:DATA'; * MODEL WITH ALL OBSERVATIONS (MODEL 1).; PROC REG DATA=CDRV.DATA; MODEL Y1=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3; RUN; * MODEL WITH HALF OF THE OBSERVATIONS (812) WHICH HAVE PROC RANK DATA=HDATA OUT=RHDATA GROUP=2; VAR LEV; RUN; PROC REG DATA=RHDATA; WHERE RLEV=0; RUN; * RETAIN THE RESPECTIVE RESIDUAL SUM OF SQUARES (RSS)
VALUES. |
Figure 48 - Sample program for Utts Rainbow test, in SPSS/PC+
|
SET MORE = OFF. GET FILE = 'DATA.SYS' . * MODEL WITH ALL OBSERVATIONS (RESTRICTED MODEL). RD1 RD2 RD3 * THE LEV VARIABLE INDICATES THE INFLUENCE EACH OBSERVATION HAS ON THE * COEFFICIENT ESTIMATES. RANK LEV /NTILE (2). * MODEL WITH HALF OF THE OBSERVATIONS (812) WHICH HAVE PROCESS IF (NLEV = 1). RD1 RD2 RD3 * RETAIN THE RESPECTIVE RESIDUAL SUM OF SQUARES (RSS)
VALUES. |
Linear Splines
This technique is useful for approximating a curvilinear regression without specifying the mathematical form of the curvature. A linear spline is a continuous piecewise-linear function, that is, one in which the adjacent line segments meet at the interval boundaries (or knots). As with other models that incorporate break points, the number and location of the intervals may be difficult to specify a priori. Attention should be paid to theoretical considerations, although a grid data search may also be employed, as in the example below. The linear spline is most appropriately used where the regression model is expected to be linear, but to have structural breaks at specific values of an explanatory variable. In the standard regression model the coefficients of the regression are restricted to be equal across spline segments. The standard version of this model is
y=Xb+Zg+e
However, it is expected that the response of y to changes in Z is distinct for three distinct regions of Z. In the example at hand, y is household calorie intake per day and Z is total weekly household expenditures. X contains all of the remaining regressors. The relationship between caloric intake and total expenditures might be expected to be different for low-expenditure, medium-expenditure, and high-expenditure families, but the precise dividing lines between low, medium, and high may not be known. The spline program will help to determine this. Note that this model has two knots; it is possible to develop models that have more, but the tensions among good fit, theory, and parsimonious parameterization should be kept in mind.
It is useful to begin by considering this model as a dummy-variable model with D1 = 1 for medium-expenditure households, zero otherwise; and D2 = 1 for high-expenditure households, zero otherwise. Then the model is
y=Xb+D1g1+D1Zg1+D2g2+D2Zg2+e
The dummy variable model does not guarantee that the piecewise segments join at the knots. Let the first knot be at L, so that low-expenditure households have income Z £ L. The second knot is at H, so that low- and medium-expenditure households have Z £ H. Then continuity at the knots is ensured if the model is specified as
y=Xb+D1(Z-L)g1+D2(Z-H)g2+e
One way to proceed is to program the computer to do a grid search over L and H, performing OLS for each (L, H) pair and checking for the pair that minimizes the RSS. These sample programs illustrate this approach. Whether the spline function leads to a significant improvement in RSS may be tested with a standard F-test (note that this is a simple application of the Chow test for structural stability). In this version of the F-test, the numerator degrees of freedom is equal to the number of knots specified and the denominator degrees of freedom is equal to the sample size less the total number of coefficients estimated in the spline function model. An alternative approach to spline modeling is given in Johnston 1984, 392-394.
The sample programs determine that the knot dividing low- and medium-expenditure households is at a log-expenditure level of approximately Z = 2.45 and that the knot dividing medium- and high-expenditure households is at a log-expenditure level of approximately Z = 4.45. The F-test (performed only in GAUSS-386) for the restricted (linear) model versus the unrestricted model (spline) yields F = 5.3889 (P-value = 0.0047), and the linear model is rejected in favor of the spline function.
|
NOTE: Since SPSS/PC+ for DOS does not include looping or macro capabilities (although SPSS/PC+ for Windows does allow loops), the spline program is not feasible. To accomplish the grid-search procedure, the SPSS/PC+ program would include thousands of lines, with the same batch of 15 to 20 lines repeated hundreds of times. The spline program in SAS PC is feasible but a little clumsy. The program relies heavily on the macro facility included in SAS PC. This makes it difficult to understand. Basically, the macro feature allows the user to define his/her own procedure (in this case, SPLINE) and then run this new procedure with user-defined parameters (START1, STOP1, STOP2, INCRM, and DENOM). In GAUSS-386, the spline program is more straightforward. Techniques used in this program are not unusual for GAUSS code; most GAUSS programmers could easily understand the program. Notice that the sample programs (Figures 49 and 50) carry out an extensive grid search over a finely divided grid. This is not necessary: experimentation with large grid steps may enable the investigator to quickly narrow down the regions in which the knots lie; then a finer search may pinpoint them. Note also that the loops begin the grid search for the upper point (H or CUTOFF2) a specific distance above the lower point (L or CUTOFF1) to avoid overlapping regions for low- and high-expenditure households. |
Recommended references: Greene (1990, 248-251); Johnston (1984, 392-396); Kmenta (1986, 569); Stewart and Wallis (1981, 202-204); Suits, Mason, and Chan (1978, 132-133).
Figure 49 - Sample spline program, in GAUSS-386
|
/********************************************************************** @-NOTE: RUN TIME IS ABOUT 7 MINUTES ON 486DX2-66, -8 FORMAT /M2 /RD 12,4; OUTPUT FILE = SPLINE.OUT RESET; NAMES = GETNAME("DATA"); Y = DATA[.,IY1]; Z1 = DATA[.,IX10]; X = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IX13 IX14
IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3] ~ Z1; NAMES = NAMES[IX1 IX2 IX8 IX9 IX13 IX14 IX15 ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3 IX10,.]; @-------- OLS ESTIMATION --------@ K = COLS(X); B = INV(X'X)*X'Y; @ BETAS @ PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @ " "; I = 1; FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; " "; "f"; @-------- LOOPS FOR SPLINE FUNCTION --------@ OUTPUT FILE = SPLINE.OUT OFF; RSSR = RSS; @ RSS FOR ORIGINAL LINEAR MODEL @ @ THE "RESTRICTED" MODEL @ RSSMIN = RSS; L = 2.20; @ OUTER LOOP TAKES L FROM 2.20 @ H = L + 0.5; @ INNER LOOP TAKES H FROM L+0.5 @ RSSMIN = RSS; @ KEEP MINIMUM RSS @ ENDIF; ENDO; OUTPUT FILE = SPLINE.OUT ON; @-------- OLS REGRESSION FOR SELECTED SPLINE FUNCTION --------@ NAMES = NAMES | "Z2" | "Z3"; D1 = DUMMYDN(Z1,LOPT,2); Z2 = D1.*(Z1 - LOPT*ONES(NCASE,1)); X = X ~ Z2 ~ Z3; K = COLS(X); B = INV(X'X)*X'Y; @ BETAS @ PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @ @-------- PRINT RESULTS FOR SELECTED SPLINE FUNCTION --------@ FORMAT /M1 /RD 12,4; " "; I = 1; FORMAT /M1 /RD 12,8; $NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1; ENDO; " "; @-------- F-TEST WHETHER (RSSR - RSSU) IS SIGNIFICANT --------@ DFN = 2; @ NUMERATOR DF = # BREAKS @ @ IN SPLINE @ DFD = NCASE - K; F = ( (RSSR - RSSU) / DFN ) / (RSSU / DFD ); " "; "f"; OUTPUT FILE = SPLINE.OUT OFF; |
Figure 50 - Sample spline program, in SAS PC
|
*********************************************************************** LIBNAME CDRV 'C:DATA'; * NONLINEARITIES ARE SUSPECTED ALONG THE DIMENSION OF THE LOG
OF * THE FOLLOWING PROC SUMMARY AND DATA STEPS MERGE THE MINIMUM
AND DATA DATAX; SET CDRV.DATA; PROC SUMMARY DATA=DATAX; VAR X10; DATA SDATA; MERGE DATAX MINMAX(DROP=_TYPE_ _FREQ_); * THE FOLLOWING DATA STEP WILL CREATE A TEMPORARY BINARY DATA
FILE TO STORE FILENAME OUTPUT 'C:DATASPLINE.BIN'; DATA _NULL_; _MODEL_ = 'DUMMY'; _MODEL_ $8. * THE FOLLOWING STATEMENT BEGINS THE DEFINITION OF THE SAS
MACRO.; * START, STOP AND INCRM MUST BE INTEGERS; %DO PNT1 = &START1 %TO &STOP1 %BY
&INCRM; %DO PNT2 = &PNT1 + &INCRM2 %TO &STOP2 %BY &INCRM; * X10 IS THE VARIABLE ACROSS WHICH WE SUSPECT NON-LINEARITY OF
THE DATA SPLINE; SET SDATA (KEEP=Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3 MINX10 MAXX10); * THE FOLLOWING USES MACRO VARIABLES TO CREATE THE TWO
CUTOFFS.; CUTOFF1 = &PNT1./&DENOM.; * THE FOLLOWING CREATES Z1, Z2, Z3 AS EXPLAINED IN
TEXT.; IF (X10 LT MINX10) THEN Z1=0; Z1=X10-MINX10; IF (X10 GE CUTOFF1) THEN Z1=&PNT1./&DENOM.-MINX10; IF (X10 LT CUTOFF1) THEN Z2=0; THEN Z2=X10-CUTOFF1; IF (X10 GE CUTOFF2) THEN Z2=CUTOFF2-CUTOFF1; IF (X10 LT CUTOFF2) THEN Z3=0; THEN Z3=X10-CUTOFF2; IF (X10 GE MAXX10 ) THEN Z3=MAXX10-CUTOFF2; * THE FOLLOWING REGRESSION SAVES THE RMSE AND A MODEL LABEL TO
THE BINARY ; OUTEST=SPLEST NOPRINT; P&PNT1.P&PNT2.: MODEL Y1= X1 X2 X8 X9 Z1 Z2 Z3 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3; DATA _NULL_; SET SPLEST; _MODEL_ $8. * THE FOLLOWING PROVIDES OUTPUT TO THE SCREEN TO MONITOR THE
PROGRESS OF THE FILE 'CON'; %END; %END; RUN ; %MEND SPLINE; * THE USER MUST PROVIDE SEARCH RANGE FOR CUTOFF1 AND CUTOFF2 AND
THE %LET START1 = 220; * THE FOLLOWING TO DATA AND PROC STATEMENTS READ IN THE RESULTS
FROM EACH DATA CDRV.SPLOUT; INFILE OUTPUT RECFM=N; _MODEL_ $8. PROC UNIVARIATE DATA=CDRV.SPLOUT; VAR _RMSE_; * INTERPRETING OUTPUT; * PROC UNIVARIATE LISTING DISPLAYS THIS MINIMUM AND THE |