Strengthening Policy Analysis - Econometric tests using microcomputer software + disk (IFPRI, 1995, 166 p.)
 (introduction...) Preface Acknowledgments 1. Introduction 2. Software Information 3. Data Handling 4. Specification Tests 5. Deficient Data Problems Appendix 1:SPSS/PC+ Environment and Commands Appendix 2: SAS PC Environment and Commands Appendix 3: Gauss-386 Environment and Commands Bibliography

### 5. Deficient Data Problems

INFLUENTIAL OBSERVATIONS

When an observation has an unusually large or small value for the dependent variable or for a regressor, that observation can substantially influence a regression. It is helpful to be able to detect and identify such observations in order to check whether they are erroneous values. The basic idea in detection is to examine how the omission of a suspect observation affects the overall regression fit and the parameter estimates. If the effect is “large,” then the relevant data point is considered to be an “influential observation.”

DFFITS

The DFFITS statistic is a standardized measure of the effect of dropping the ith observation on the fitted value of the dependent variable. DFFITS is calculated as

where

= ith OLS fitted value of the dependent variable, yi;

= fitted value of yi, after deleting the ith observation and reestimating the parameters;

si = standard error of the residuals, with ith observation deleted;

hii = ith diagonal of the projection matrix,

= ith row of X, the N × K matrix of explanatory variables;

N = number of observations; and

K = number of explanatory variables, including the constant term.

Formal critical values for DFFITS statistics have not been developed, but some rules of thumb have been suggested. One such rule states that an absolute DFFITSi value greater than 2(K/N)1/2 (DFFITS can be both positive or negative) indicates an influential observation (Krasker, Kuh, and Welsch 1983). An alternative rule suggests 0.34 as a useful cutoff (Welsch 1980).

DFFITS statistics are calculated following the steps described below.

 Step 1 Perform OLS on the full data set and retain the fitted-y values. Step 2 Loop through the data, at the ith loop deleting the ith observation, and perform steps 3 through 5. Step 3 With the data set reduced by one observation, calculate the OLS coefficients and the standard error of the residuals. Step 4 using the Xi values, calculate the fitted value of yi, Step 5 Calculate the DFFITS statistic according to the formula above.

The sample programs for DFFITS calculation, Figures 51 through 53, estimate the model that has been used in all other sample programs. Using the 2(K/N)1/2 cutoff, all three sample programs find a total of 100 (out of 1,624) influential observations. The largest 10 are printed.

Note that SPSS/PC+ and SAS PC label these calculations differently in their preprogrammed options. In SAS PC, the option called DFFITS produces what is described in the text as DFFITSi. SPSS/PC+, however, calculates the same statistic for each observation, but the procedure that generated the numbers is called SDFIT.

 NOTE: DFBETAS is another procedure in SPSS/PC+ that assesses the sensitivity of regression estimates to the deletion of the ith data point.

Recommended references: Kennedy (1992, 284, 285); Kmenta (1986, 424-426); Krasker, Kuh, and Welsch (1983); Maddala (1988, 417-418); Welsch (1980).

Figure 51 - Sample program for DFFITS calculation, in GAUSS-386

 /******************************************************************** PROGRAM: DFFITS.G SOFTWARE: GAUSS-386 V3.0* FILENAME DESCRIPTION* INPUTS: DATA.DAT GAUSS-386 DATA SET* PURPOSE: CALCULATE DFFITS STATISTICS FOR ALL* OBSERVATIONS AND REPORT THOSE THAT ARE* LARGE. THE RUNNING TIME FOR THIS* PROGRAM IS ABOUT 3 HOURS WITH THE* FULL DATA SET.USE A SUBSET OF THE* DATA FOR FASTER TURNAROUND TIME.*******************************************************************/ * NOTE: RUN TIME IS ABOUT 110 MINUTES ON 486DX2-66; FORMAT /M2 /RD 12,4;OUTPUT FILE = DFFITS.OUT RESET; NAMES = GETNAME("DATA");OPEN D = DATA VARINDXI;NCASE = ROWSF(D);DATA = READR(D,NCASE);F = CLOSE(D); Y = DATA[.,IY1]; X = ONES(NCASE,1) ~ DATA[.,IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3]; NAMES = NAMES[IX1 IX2 IX8 IX9 IX10 IX13 IX14 IX15ID1 ID2 ID3 ID5 ID6 ID7 ID8 IRD1 IRD2 IRD3,.]; @-------- OLS ESTIMATION --------@ K = COLS(X); B = INV(X'X)*X'Y; @ BETAS @YHAT = X*B; @ FITTED VALUES @E = Y - YHAT; @ RESIDUALS @RSS = E'E; @ RESIDUAL SUM OF SQ @SER = SQRT(INV(NCASE-K)*RSS); @ STD ERR OF REGRESSION @RSQ = 1 - RSS/((NCASE-1)*(STDC(Y))^2); @ R-SQUARED @COV = INV(NCASE-K)*RSS*INV(X'X); @ OLS COVARIANCE MATRIX @SE = SQRT(DIAG(COV)); @ STD ERRS OF BETAS @T = B ./ SE; @ T-STATISTICS FOR BETAS @PT = 2*CDFTC(ABS(T),(NCASE-K)); @ P-VALUES @PRN = B ~ SE ~ T ~ PT; @ FOR PRINTING @ " ";" ";" ";" OLS RESULTS ";" ";" ";" NUMBER OF OBSERVATIONS = ";; NCASE;" STANDARD ERROR OF REGRESSION = ";; SER;" RESIDUAL SUM OF SQUARES = ";; RSS;" R-SQUARED = ";; RSQ;" ";" ";" VARIABLE COEFF STD ERROR T-RATIO P-VALUE";" ";" INTERCEPT ";; PRN[1,.]; I = 1;DO WHILE I <= K -1;FORMAT /M1 /RD 12,8; \$NAMES[I,.];; FORMAT /M1 /RD 12,4; PRN[I+1,.]; I = I + 1;ENDO;" ";"f";@-------- CONSTRUCT VECTOR OF DFFITS STATISTICS --------@@-------- THE MATRIX DFFITS CONTAINS "OBSERVATION --------@@-------- NUMBER" IN THE FIRST COLUMN AND THE --------@@-------- CORRESPONDNING DFFITS STATISTIC IN THE --------@@-------- SECOND COLUMN. --------@ N = NCASE;I = 1;YO = Y;XO = X; COUNT = SEQA(1,1,NCASE); CLEAR DATA COV SE T PT PRN Y X; XXI = INV(XO'XO);H = ZEROS(N,2); OUTPUT FILE = DFFITS.OUT OFF; DO WHILE I <= N; @ LOOP OVER WHOLE SAMPLE @ YI = YO[I,.];XI = XO[I,.];Y = SELIF(YO,COUNT[.,1] .NE I);X = SELIF(XO,COUNT[.,1] .NE I);G = INV(X'X)*X'Y; YHATI = XO[I,.]*G; E = Y - X*G;SERI = SQRT(INV(NCASE - K - 1)*E'E); HAT = XO[I,.]*XXI*XO[I,.]'; DFFITS = (YHAT[I,.] - YHATI) / (SERI*SQRT(HAT)); H[I,1] = I;H[I,2] = DFFITS; "LOOP I = ";; I; I = I + 1; ENDO; @ END OF LOOP @ OUTPUT FILE = DFFITS.OUT ON; @-------- SELECT |DFFIT| VALUES GREATER THAN 2*SQRT(K/N) --------@ CUT = 2*SQRT(K/N); H = ABS(H); H = SELIF(H,H[.,2] .> CUT); ND = ROWS(H); H = REV(SORTC(H,2)); " ";" ";" ";" TEN LARGEST ABS(DFFITS) GREATER THAN 2*SQRT(K/N)";" ";" ";FORMAT /M1 /RD 8,0;ND;; "OBSERVATIONS HAVE ABSOLUTE VALUES > 2*SQRT(K/N);";FORMAT /M1 /RD 12,4;" ";" ";" OBSERVATION DFFITS STATISTIC";" "; I = 1; DO WHILE I <= 10;FORMAT /M1 /RD 12,0; H[I,1];; FORMAT /M1 /RD 12,4; H[I,2]; I = I + 1;ENDO; "f"; OUTPUT FILE = DFFITS.OUT OFF;SYSTEM;

Figure 52 - Sample program for DFFITS calculation, in SAS PC

 *************************************************************************** PROGRAM: DFFITS.SAS SOFTWARE: SAS PC 6.04* FILENAME DESCRIPTION* INPUTS: DATA.SSD SAS DATA SET* PURPOSE: CALCULATE DFFITS STATISTICS FOR ALL* OBSERVATIONS AND REPORT THOSE THE 10 LARGEST.************************************************************************; LIBNAME CDRV 'C:DATA'; PROC REG DATA = CDRV.DATA;MODEL Y1 = X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3; OUTPUT OUT = HAT1 DFFITS = ODFFIT;RUN; * LIST THE 10 LARGEST VALUES OF ODFFIT.; DATA HAT2;SET HAT1;AODFFIT=1/ABS(ODFFIT);RUN; PROC RANK DATA=HAT2 OUT=RHAT;VAR AODFFIT;RANKS RAODFFIT;RUN; PROC SORT DATA=RHAT;BY RAODFFIT;RUN; PROC PRINT DATA = RHAT (OBS = 10);VAR RAODFFIT ODFFIT Y1 X1 X2 X8 X9 X10 RD1 RD2 RD3;RUN; * NOTE THAT SAS EMPLOYS A DIFFERENT CALCULATION FOR THE PROCEDURE IT* LABELS AS DFFITS THAN DOES SPSS.* SAS DFFITS = SPSS SDFIT (STANDARDIZED VERSION OF WHAT SPSS LABELS AS* DFFITS).;

Figure 53 - Sample program for DFFITS calculation, in SPSS/PC+

 SET MORE OFF.SET LIS = 'DFFITS.LIS'.SET LOG = 'DFFITS.LOG'.************************************************************************* PROGRAM: DFFITS.SPS SOFTWARE: SPSS/PC+ 4.01* FILENAME DESCRIPTION* INPUTS: DATA.SYS SPSS/PC+ DATA SET* PURPOSE: CALCULATE DFFITS STATISTICS FOR ALL* OBSERVATIONS AND REPORT THOSE THE 10 LARGEST.***********************************************************************. GET FILE = 'DATA.SYS'. REG VAR=Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3/DEP=Y1/METHOD=ENTER/SAVE SDFIT(ODFFIT).* LIST THE 10 LARGEST VALUES OF ODFFIT.COMPUTE ABSDFIT=ABS(ODFFIT).RANK ABSDFIT.SORT RABSDFIT (D).N 10.LIST RABSDFIT ODFFIT Y1 X1 X2 X8 X9. * NOTE THAT SPSS EMPLOYS A DIFFERENT CALCULATION FOR THE PROCEDURE IT* LABELS AS DFFITS THAN DOES SAS.* SPSS SDFIT (STANDARDIZED VERSION OF WHAT SPSS LABELS AS DFFITS) =* SAS DFFITS.FINISH.

Bounded Influence Estimation

As noted by Maddala (1988), the conventional approach to outliers based on least squares residuals is to delete observations with large residuals and reestimate the equation. Given that the OLS residuals do not provide any readily useful information as to the importance of a given observation for overall results, a number of alternative procedures for dealing with outliers have been developed. The Bounded Influence Estimation (BIE) of Welsch (1980) is designed to evaluate the influence of individual observations, and to weight influential observations by a weight that is inversely related to the measure of influence. Thus, highly influential observations are not deleted (reducing the degrees of freedom and throwing out potentially useful information), but their influence is reduced. The measure of influence used is the DFFITS measure discussed in the previous section.

The simple one-step BIE developed by Welsch is defined as the value of b that minimizes:

S(wi[yi-bxi])2
where

 if

and

 = if .

Thus, for noninfluential observations, wi=1. If wi=1 for all i, then this is the OLS estimator. Essentially, this technique places observations into two distinct regimes, on the basis of their DFFITS values, and then observations are weighted accordingly. However, the BIE technique should not be used as a substitute for a careful examination of the data-generating process. It may be the case that the influential observations are only exceptional because the model is inappropriate or because observations are inappropriately pooled.

The sample programs (Figures 54 through 56) are extensions of the DFFITS programs presented in the DFFITS section. The regression results do not change very much when the BIE technique is employed (for example, the OLS coefficient on X10 is 216.97, and the BIE estimator for X10 is 217.994).

Recommended references: Kennedy (1992, 282, 284-285); Maddala (1988, 418); Welsch (1980).

Figure 54 - Sample program for estimating bounded influence, in GAUSS-386

Figure 55 - Sample program for estimating bounded influence, in SAS PC

 ************************************************************* PROGRAM: BIE.SAS SOFTWARE: SAS PC 6.04* FILENAME DESCRIPTION* INPUTS: DATA.SSD SAS PC DATA SET* PURPOSE: BOUNDED INFLUENCE ESTIMATION.***********************************************************;LIBNAME CDRV 'C:DATA'; *STEP 1: RUN REGRESSION USING COMPLETE DATA SET (MATRIXL). SAVE DFFITS STATISTIC INVARIABLE DFT, AND WRITE TO FILE, INFL.; PROC REG DATA = CDRV.DATA;MODEL Y1 = X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3; OUTPUT OUT = HAT DFFITS = ODFFIT;RUN; * STEP 2: MERGE DATA SETS MATRIX1 AND INFL. THE VARIABLE, DFT, IS REPEATED FOR EACH OBSERVATION IN MATRIX1. EVALUATE CONDITION GIVEN BY EQN ** ABOVE, AND CREATE NEW (WEIGHT) VARIABLE, W.; DATA DFDATA;MERGE CDRV.DATA HAT;CUTOFF = 2 * ((19 / 1624) ** .5);IF ABS(ODFFIT) LE CUTOFF THEN W = 1; ELSE W = CUTOFF/ABS(ODFFIT);RUN; * STEP 3: RUN NEW (WEIGHTED LEAST SQUARES) REGRESSION.; PROC REG DATA = DFDATA;MODEL Y1=X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3;WEIGHT W;RUN; * FOR THIS EXAMPLE, NOTICE THAT THE COEFFICIENTS CHANGE SLIGHTLY* DUE TO THIS PROCEDURE. FOR EXAMPLE IN THE OLS MODEL THE COEFFICIENT ON* X10 IS 216.97, BUT IN THE BOUNDED INFLUENCE ESTIMATES THE COEFFICIENT* ON X10 IS 217.994.;

Figure 56—Sample program for estimating bounded influence, in SPSS/PC+

 SET MORE OFF.SET LIS = 'BIE.LIS'.SET LOG = 'BIE.LOG'.*************************************************************** PROGRAM: BIE.SPS SOFTWARE: SPSS/PC+ 4.01* FILENAME DESCRIPTION* INPUTS: DATA.SYS TEST DATA SET* PURPOSE: BOUNDED INFLUENCE ESTIMATION.*************************************************************. GET FILE = 'DATA.SYS'. * STEP 1: RUN REGRESSION USING COMPLETE DATA SET (MATRIX1).* SAVE DFFIT IN VARIABLE DFT. REG VAR=Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3/DEP=Y1/METHOD=ENTER X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3 /SAVE SDFIT(ODFFIT). * STEP 2: UTILIZE A SIMPLE IF STATEMENT TO CREATE NEW VARIABLE, W, TO BE USED IN A WEIGHTED LEAST SQUARES. COMPUTE CUTOFF = 2 * ((19 / 1624) ** .5).COMPUTE W = CUTOFF/ABS(ODFFIT).IF (ABS(ODFFIT) LE CUTOFF) W = 1. * STEP 3: RUN A SECONDARY REGRESSION, UTILIZING THE NEWLY CONSTRUCTED VARIABLE, W, AS A WEIGHT. REGRESSION/VARIABLES = Y1 X1 X2 X8 X9 X10 X13 X14 X15 D1 D2 D3 D5 D6 D7 D8 RD1 RD2 RD3 /REGWGT = W/DEPENDENT = Y1/METHOD=ENTER.* FOR THIS EXAMPLE, NOTICE THAT THE COEFFICIENTS CHANGE SLIGHTLY* DUE TO THIS PROCEDURE. FOR EXAMPLE IN THE OLS MODEL THE COEFFICIENT ON* X10 IS 216.97, BUT IN THE BOUNDED INFLUENCE ESTIMATES THE COEFFICIENT* ON X10 IS 217.994.FINISH.

MISSING DATA

An obvious problem for estimation occurs when a data set is incomplete, such as when a survey respondent only partially completes a questionnaire. The easiest solution is to simply drop the observations that are incomplete. If the quality of information for a particular observation is very poor, this may be the only reasonable solution. However, given the often high cost of gathering data and the fact that discarding data reduces the precision of estimators, this solution is often resisted. An alternative, if relatively few pieces of information are missing, is to try to fill in the blanks. As alternatives to dropping observations, the following two procedures are easily implemented:

· Zero-Order Regressions. If both regressors and dependent variables have missing values, these regressions—in which the missing data are replaced by sample means—may be used.

· First-Order Regressions. If only the regressors have missing values, these regressions—in which the missing values are first estimated by considering the relationships among all of the regressors—may be used.

Simple Zero-Order Regressions (Mean Substitution)

Let X denote an N × K matrix of regressors. Assume that a single column of X, Xk, has a number of missing observations.

Let Y denote an N × 1 dependent variable. Assume that Y also has a number of missing observations; they need not be the same observations as those missing from Xk.

The strategy is to simply replace the missing observations of Xk and Y by their mean values for the complete observations. Greene (1990, 285-189) summarizes known results for this strategy and concludes that using mean Y values of complete observations to impute values for missing Ys is a poor strategy that is unlikely to yield any gain to the researcher. Greene also points out (footnote 16, page 287) that replacing missing X-values by their means does not yield unbiased results, as suggested by Kmenta (1986). Therefore the zero-order regression strategy is not pursued any further.

Recommended references: Greene (1990, 285-289); Kmenta (1986, 379-387).

First-Order Regressions (Incidental Equations)

In contrast to zero-order regressions, the incidental equations method may enable the researcher to exploit information contained in correlations among Xs to impute some missing values of a regressor. In the sample programs, every twentieth observation on Xk (= X10) (beginning with number 20) is coded as missing.

 Step 1 Using only those observations with complete data, regress Xk on the variables in X for which no observations are missing (all variables except Xk). Let this matrix be Z. Retain the estimated coefficients from regressing Xk on Z. Step 2 Compute fitted values for the missing values of Xk, using the estimated regression coefficients and the relevant observations on Z. So, if the seventh observation in Xk is missing, use the seventh observation on Z together with the coefficients from Step 1 to fit Xk,7. Step 3 Substitute these fitted values for the missing observations in Xk. Now proceed with your intended regression.

Figures 57 through 59 are sample programs for calculating first-order regressions when data are missing.

 NOTES: 1. Maddala (1977) suggests that if the correlations among the regressors in an equation are moderately high, this first-order method is preferable to the zero-order method. 2. Kmenta (1986) argues that the first-order method implicitly defines a system of simultaneous equations (because Xk is a dependent as well as an independent variable) and, therefore, this method may be theoretically unsound. In addition, Kmenta warns against the introduction of measurement error to Xk through this type of interpolation. 3. All three programs produce estimates that are similar to estimates for no missing values. For instance, the estimated coefficient on X10 with missing values (5 percent of observations on X10 are coded as missing) is 217.19 as opposed to 216.97 with no missing values on X10.

Recommended references: Afifi and Elashoff (1966, 1967, 1969); Greene (1990, 285-289); Haitovsky (1968, 67-82); Kmenta (1986, 379-388); Maddala (1977, 201-207).

Figure 57 - Sample program for calculating first-order regressions when data are missing, in GAUSS-386