Cover Image
close this bookMethods for the Evaluation of the Impact of Food and Nutrition Programmes (UNU, 1984, 287 p.)
close this folder12. Data recording and processing
View the document(introduction...)
View the documentIntroduction
View the documentData recording
View the documentData processing
View the documentConcluding remarks
View the documentReferences


Michael Guzman, Ricardo Sibrian, and Rafael Flores Introduction
Data recording
Data processing
Concluding remarks


In contrast with laboratory investigations which commonly give rise to relatively few observations, large-scale nutrition intervention programmes require the collection, orderly handling and management of large quantities of data.

Since the data ultimately constitute the link between the design of the intervention and the evaluation of results, its management and handling clearly merit careful consideration. In this context, the procedures required for the collection of data and its subsequent treatment, which include the definition of the plan for analysis and expected outputs, should be an integral part of the study design. Thus, such procedures should be explicitly defined in the standard operating protocol (SOP) of the study.

Because of the highly-specialized nature of the skills required for both proper data management and analysis, it is advisable that specialists in these fields are included as part of the evaluation staff. Under such an arrangement, these specialists fully participate in the planning and execution of the evaluation.

Some basic procedures relating to various aspects of data recording and processing are described in this chapter, Although the coverage is neither complete nor exhaustive, it is hoped that the topics considered provide some general guidelines which may be useful as a frame of reference for identifying appropriate data management procedure under the specific set of circumstances of a particular study.

The processes to be described can best be summarized in terms of a gross flow chart diagram, as illustrated in figure 12.1. (see

FIG. 12.1. Stages of Data Recording and Processing). Obviously, the different stages depicted here on a macro basis can, and must be, expanded in detail in accord with conditions pertaining to any specific investigation. Two examples of such expansion are presented later in the text in connection with the preparation of forms and questionnaires and the description of the sequence of events that relate to the process of data analysis.

Data recording

The general purpose of data recording is to set in writing and assure the preservation of the data collected in the course of field or laboratory studies.

The experimental design of each study determines the types of data to be collected in terms of the objectives and resources available for the study. The types of data commonly used in field studies, among others, often relate to morbidity, anthropometry diet. immunology and anthropology. Whatever the nature of the types of data, however, there is need for suitable forms or questionnaires to record the information to be gathered. It is often convenient to prepare these forms or questionnaires by discipline or type of data. The use of precoded forms or questionnaires that permit the direct registry of data is to be preferred, since with proper training, their use often results in fewer errors. Additionally, only one protocol or set of forms will be used to collect and code the information to be recorded in the field or in the laboratory for each unit of study (e.g., family or individual).


Form or Questionnaire Preparation

The objective of this stage is to produce all the needed forms and/or questionnaires in their final versions, as they will be used in the field or laboratory. These forms and questionnaires must be accompanied by a set of detailed instructions explicitly set out in a coding manual. In general, three steps are involved in the preparation of forms or questionnaires which comprise a series of coordinated actions as shown schematically in figure 12.2. (see

FIG. 12.2. Coordinated Actions in the Preparation of Forms and Questionnaires for Data Recording).

The forms and questionnaires contain the information needed by both the investigator and the data processing personnel, and generally consist of two parts: a heading and a body.

The heading of the forms or questionnaires includes information needed mainly to prepare appropriate data files in accord with the objectives of the study as defined by the responsible investigator. The heading, however, may also include information to allow subject recall by the investigator, either for further interviewing or for checking the original recordings. Clearly, the kind of items in this part of the form or questionnaire varies with the nature of the study, but generally must include information of the type specified in the first 14 items in the scheme suggested in figure 12.3. (see

FIG. 12.3 Flow Chart of Actions Generally Required in Data Analysis).

In table 12.1., the body of the form begins at item 15 and includes the actual data and information required to satisfy the objectives of the study. As many fields and digits as are necessary to complete recording may be used in the body of the form. However, it is always advisable to consult with the personnel that will be responsible for the data processing and analysis to avoid problems related to data management.

TABLE 12.1. Sample Questionnaire Form


Item Identification

Field Position

1 General information (i.e. Protocol page number) Open (not for coding)
2 General information (i.e. Name of subject) Open (not for coding)
3 General information (i.e. Address of subject) Open (not for coding)
4 Study identification 1-3
5 Area (Type of data) 4-5
6 Form identification 6-7
7 Date 8-13
8 Examiner/lnterviewer identification 14-15
9 First level of enquiry (Country) 16-17
10 Second level of enquiry (Community) 18-19
11 Third level of enquiry (Family) 20-21
12 Individual identification 22-25
13 Sex 26
14 Birth date 27-32
15 Data 33-
. . .
. . .
. . .

Some general comments about the heading or identification portion of the data record are in order. Each study and type of data or area should be assigned a code. For each type of data there may be as many forms as needed for complete recording, and therefore, each form also requires the assignment of a proper code identification. Since the study sample generally relates to country, region, community or similar geographic location classes, these items also must be identified a priori with specific codes. The data processing personnel, in the computer center or elsewhere, who will be responsible for handling the data for a given study should collaborate with the investigator in the assignment of these codes since, as stated earlier, these will be used mainly to organize and control the files and expected outputs within the system of data processing.

With the above information, the investigator will elaborate a first version of the forms or questionnaires and a first version of the corresponding coding manual. In particular, the coding manual must provide specific answers to the following questions:

  1. How is the form or questionnaire to be filled?
  2. How is each item included in the form or questionnaire to be coded?

Once the researcher has developed the first version of the forms and questionnaires, the next steps involve the application of procedures for testing and revising the original drafts. For this purpose the investigator will use a small sample (10-20 experimental units) to actually carry out the complete process of data collection; in the process the investigator will check all forms and questionnaires for ease of handling and use under field conditions. The adequacy of instructions and codes in the actual process of recording data also will be tested at this time.

The field tests will permit proper adjustments and improvement of the recording forms and accessory materials, prior to preparing them for production in sufficient volume to satisfy the needs of the study. The investigator must also consult with the personnel responsible for processing data prior to producing the definitive versions of the forms and questionnaires to be used in the evaluation. In the particular case of questionnaires, their reliability should be scrutinized using appropriate test-retest procedures (1). The testing required for developing the forms and questionnaires offers the opportunity to include activities related to the training and coordinating of examiners and interviewers. Otherwise, the training and standardization procedures must take place later, but always prior to the initiation of actual data collection (2, 3).


Data Collection

Data collection can be initiated when the personnel responsible for data collection have been properly trained and have reached a satisfactory level of standardization. In addition, forms, questionnaires and coding manuals must be considered operational. The description of recording forms, and the techniques and procedures to be employed should be integrated into a standard operating protocol (SOP) for the evaluation (2). In the course of long term studies, changes in procedure may be mandatory. Accordingly, it is advisable to produce the SOP in a loose leaf form for ease of insertions as may be required. In this connection. however, it is essential that all changes introduced in the course of the evaluation be fully documented in terms of justification, nature of the change and date of implementation.

Several types of errors may arise during the data collection stages which may produce biases affecting the interpretation of results. These errors are generally associated with failure to complete interviews, missing data, interviewer mistakes, and conceptual misunderstandings, lack of knowledge, and intentional misrepresentations of truth by the respondents. To minimize the effects of these factors or conditions, special attention must be given to proper supervision throughout the data collection stages. Emphasis shall be placed on correct household selection, formulation of questions, recording of answers and the application of proper follow-up procedures to reduce non-reponses. Supervision can take place either through direct observation by field supervisors and/or by actual live recording of the interviews (4). In any case, full documentation of the execution of all aspects and levels of activity is essential. This includes field procedures, and data collection, editing, input and analysis. In particular, causes of missing data must be fully documented, since such information is essential for identifying possible biases arising from sample attrition.



This stage can be initiated even before the actual collection of data. For example, some items in the heading of the form can be precoded using computer facilities. Computers may also be used to produce the self-printed forms which contain information on the types of data to be collected, the geographic classification (country, community) and the observation unit (family, individual). More generally, however, forms and questionnaires are coded after data collection. In such a procedure, it is advisable that the coding be completed as soon as possible, preferably on the same day that the data were collected.

Data processing

In general, data processing can be understood as the treatment given to the data after collection. In small evaluations, this treatment is usually manual. In the case of large-scale efforts, the bulk of the data handling requires access to computer facilities although some parts of the data processing may be performed manually. In this context, a description of manual and computerized techniques will be reviewed, with special reference to large-scale studies.


Data Input

Recent advances in computer sciences provide a choice of alternatives for data input. These range from the use of the traditional punched card to direct access with automated systems using mark sense devices or direct on-line input from a measuring apparatus.

When the survey or research comprises a small number of cases and each case is evaluated in terms of many characteristics, an interactive data input procedure might be recommended, especially when the original data is generated in a place where facilities for data input (terminals) are available or can be easily installed. The interactive data input procedures provide the opportunity to test for completeness, inconsistencies and errors at the data sources. This often permits the implementation of proper procedures of data recovery. Unfortunately, this type of data input will undoubtedly have limited application in field evaluations.

When interactive data input procedures are not applicable, some type of key-to-tape data input systems must be implemented. In such systems, the speed of data recording can be high. However, immediate checks for completeness or inconsistency controls are not possible. since the processing of data unavoidably must take place with a delay. Under these conditions, delayed checks for errors and completeness or inconsistency controls are possible. although the recovery of data in most instances is practically impossible.


Data Quality Control

The control of data quality is a most important aspect of any research process. Once the data have been collected and coded. the control of its quality generally proceeds in two stages: the first relates to completeness and the second to the internal consistency among the various items that comprise the data set.

The preliminary controls for completeness of the data are usually, but not necessarily, performed after the coding of the data is complete. The purpose of this exercise is to control for the inclusion of every required item in each observation vector, both in terms of identification and actual data items (variables).

As indicated earlier, the identification portion of the observation vector generally includes several items or information bits that describe different individual characteristics. These descriptive items, considered in parallel with the evaluation or survey design, provide the reference criteria for the preliminary control of the completeness of the set of observation units. Thus, if there exist three items in the identification portion of the observation vector, identified as a primary unit code (PU). a secondary unit code (SU) and the individual number within the SU (IN), then the identification for each observation vector would be a composite set of characters made as PU-SU-IN. For example, if there are 25 PU and the number of the IN differ in each SU among the PU, then a complete inventory of the possible codes for each PU can be constructed. As an illustration of this hypothetical case, a list of the acceptable codes for a preliminary identification control of completeness for the set of observations is presented in table 12.2.

TABLE 12.2. List of Codes (inventory of codes) for PU (Primary units), SU (secondary units), and Number of IN (individuals)






































































In this example, the detailed identification codes for observations included in the first PU would be as follows:

SU1 01101, 01102, 01103, ··· 01110
SU2 01201, 01202, 01203, ··· 01237
*** ***
SU6 01601, 01602, 01603, ···, 01606

In the same manner, the admissible identification codes can be listed in detail for each of the 25 PU.

The actual control for admissible identification codes can be carried out in different ways. One procedure is the simple visual checking of the existence of each recorded entry for each identification item required, without controlling explicitly for the validity of such entries. Another possibility is a complete and detailed visual checking of the list of valid identification codes, such as that presented in table 12.2., for every individual information vector. In this latter alternative, practical feasibility diminishes as the size of the observation set increases. However, the control of validity and completeness of the identification portion of each information vector can be performed accurately and efficiently on large observation sets through complex but automated systems using computer facilities.

The preliminary quality control procedures also relate to the checking of completeness for the remaining portion of the observation vector, which constitutes the actual data portion (variables) of each vector. In this case, special care is required to identify logical omissions of data which may be the valid result of logical associations among variables. For example, when one observation data item identifies a male subject, the observation vector for this subject cannot, and must not, include data items that refer to the number of pregnancies of the subject.

In the preliminary procedures for the quality control of observation of data items, it is often possible to include obvious control items that do not require much effort in the checking process. For example, if the questionnaire is applicable only to adults for example those 18 years of age or older, it is possible, while checking for completeness of the age information bits, to identify subjects under 18 years of age.

Incidentally, when the preliminary procedures for the quality control are applied soon after collection and coding, it often may be possible to recover missing bits of data by going back to the field. This possibility should always be considered and the rules governing such procedures explicity addressed in the SOP.

After satisfactorily completing this preliminary stage of data quality control checks, the first stage edited information vectors are ready for entry into appropriate devices for further processing. This is done prior to implementing computing procedures as required under the plan of analyses defined in the SOP.

Obviously, the number and magnitude of errors can be most efficiently reduced by improving, in the planning and testing stages, the procedures for collecting and ennumerating data, rather than by increasing the number of a posterior revisions and internal consistency checks. Independently of the adequacy of collection and ennumeration procedures, however, consistency controls are always essential. They will substantially contribute to the "cleanliness" of the data. This type of quality control ranges from simple to fairly complicated checks, designed for detecting at different levels contradictions contained in the data.

The processing required in the control of consistency generally relates to two types of variables: continuous variables (interval scaled) such as age, weight, height, temperature and blood values; and discrete variables (nominally or ordinally scaled) such as sex, race, marital status and birth order.

The actions to be taken when an error is detected through any checking procedure are as follows:

  1. Rechecking of the original data records to decide on recovery, acceptance or rejection;
  2. Automatic deletion of a specific datum;
  3. Automatic deletion of a specific datum with additional checking for decisions concerning data related to the questionable datum;
  4. Deletion of the complete observation vector (all variables in the observation).

Although there are many possibilities for consistency controls, the procedures to be applied generally relate to the check of admissible ranges, and the examination of arithmetic, logical or special relations among variables.

The check and control of admissible ranges applies to both continuous and discrete variables, since in the latter case the numeric codes assigned to the various classes can be examined in terms of the admissible numerical values of the codes that have been defined for a given variable. Range controls are usually applied to the basic data collected. They may also be applied to indices, ratios or any other forms of data derived from the original observations. The inclusion of derived data in the control of ranges often provides opportunities to detect inconsistencies that may not be apparent in the original data.

Since different variables within a case are often related, arithmetic relations among pairs of variables also can be used in the process of internal consistency controls. Consider, for example, a pair of variables, A and B. In the consistency control procedures, it is possible to specifically check conditions such as A greater than B; that is, the numeric value of A is always greater than the value of B, except in a situation of a "not applicable" answer for either of A or B. An example of this may be the number of persons in the family (variable A) and the number of children under five years of age in the same family (variable B). Similarly, the condition A is greater than or equal to B, that is, A is always equal or greater than B can be defined and checked. For example, note the case where variable A is the number of children in the family and variable B is the number of children under five years of age in the same family. A simple numeric equality relation A=B, generally would represent duplicity in the data, but sometimes it may be an appropriate criteria for consistency checks as is the case, for example, when data records are reshuffled and variable names are changed.

The types of consistency controls described above also can be applied to derived data. For example, if a new derived variable is the sum of components, A, B, and C, then, the derived variable D may be independently checked against the actual sum of the components (A+B+C). The arithmetic relationship controls are most often used when checking continuous variables (interval scaled), although it is possible to make limited use of such controls in the case of some discrete variables (ordinally scaled).

The control of logical relations are based on the dependency of one variable on another. For example, the variable C can take on different specific values or a range of values, depending on the values for the variables A and/or B. More specifically, if C is the weight for a child, A is the sex of that child, and B his age, then, if a criterion of range can be defined for weight depending upon sex (A) and age (B), then the weight (C) of an 18 month old male child should be a value within the admissible range corresponding to his sex (male) and age (18 months).

The control of logical relations is applicable to many conditions, but always must be based on criteria specifically defined by the researcher. The logical criteria required are generally presented in the form of "if-then" or "if and only if" statements. A general outline of some common logical control criteria is presented in table 12.3. (see TABLE 12.3. General Outline and Examples of Different Kinds of Logical Checks).

TABLE 12.3. General Outline and Examples of Different Kinds of Logical Checks One-way Checks Two-way Checks Prototype: IF A = x then B = y Prototype: A = x if and only if B = y

1. Between variable pairs

Examples: Examples:
a. If A = 1 then B = 1, 3, 5, 7 e. A = 1 if end only if B = 5
b. If A = 1, 3 then B = 4, 6 f. A = 1.3,6 if and only if B = 2, 5, 6
c. If A= 1 - 10 then B = 3- 18 g. A= 1 - 10 if and only if B = 3- 18
d. If A = 1 - 4, 8 - 12 then B = 2 - 12 h. A = 1 - 4, 8 - 12 if and only if
  B = 5 - 7, 15 - 30

2. Among several variables

Examples: Examples:
i. If A = 1 and B = 2 then C = 5 10, 11 I. A = 1 and B = 2 if and only if
  C = 5, 1 0, 1 1
j. If A = 1 - 5 and B = 10 - 20 then m. A = 1 - 5 and B = 10 - 20 if and
C = 2 - 8 only if C = 2 - 8
k. If A = 1 - 5, B = 1 - 5 and C = 1 then  
D = 5- 10  

The basic difference between one-way and two-way controls relates to the uniqueness of correspondence. For example, when checking the criteria "if A = 1 then B = 1, 3, 5, 7" (example 1.a in table 12.3.), a finding A = 1 implies that 1 or 3 or 5 or 7 are acceptable answers to B, but it does not imply that if B = 1 or 3 or 5 or 7, A is necessarily equal to 1. In the case of a two way control the "if-then" statement becomes an "if and only if" statement, as in example 1.e in table 12.3. In this case when checking the criteria A = 1 if and only if B = 5, just as the finding A = 1 implies B = 5 conversely the finding B = 5 implies A = 1.

The description of preliminary control of data, laid emphasis on careful procedures for verifying the completeness of the identification items. In addition to completeness, it is also essential to check for inconsistencies in identification. In this connection, special procedures such as look-up systems using binary search techniques for possible identifiers and self-checking identification number systems (for example, modulus 10 and modulus 11 techniques) can be used effectively for checking inconsistencies in the identification portion of the information vector. As can be expected, however, error identification by the control checks described above is not exhaustive. Special situations may arise for any of the variables of interest. In the data processing required for detecting errors, these cases may be handled by including in the data editing programme one or more appropriate subroutines to check special relations or conditions pertaining to a specific set of data items or observation vectors.

Special cases require individual attention; and in this connection, general techniques such as sorting and searching are useful. The choice of specific techniques for sorting or searching in a particular situation will depend on the way the main set of data items relate to each other in the observation vector, and this in turn will determine the type of controls to be applied to the individual data items in the observation vector. For example, when checking a combination of codes or a coded data item, the binary search may be the procedure of choice for looking up the acceptability of the recorded set or the coded data items, since generally there is no continuous sequence in the structure of such codes. However, when a continuous data structure is used, a direct searching technique may be the method of choice. Another type of checking which may be useful is the "route of answer check." In this instance an answer to a specific question for a subset of the data vector is not applicable. A tree for describing the "route of answers" within the allowed answers to the questions under consideration permits the construction of an ordered path based on the relations among the items contained in the data vector.


Data Bank

The quality control of data will produce clean files for each type of data collected. A properly identified and cross-related set of such files is called a Data Bank.

The master data file will be created from the data bank by merging individual files using proper identification keys: study, data type, form identification, family, individual, date and examiner, for example. It is important to stress the need of complete and full documentation of the structure of the master data file, since this provides the keys and needed criteria for manipulating the information it contains When a properly and exhaustively documented master file is ready, the stage of data analysis can begin.

With computer system facilities having capability for Data Base Management, the Data Bank constitutes the original source of data for structuring a useful Data Base (5) for subsequent processing. This feature is particularly useful for executing the statistical analyses required in the testing of specific hypotheses.

It is also important to point out that the data bank stage is not fixed. It is a very dynamic situation requiring continuous action and attention for as long as the interactive processes of data analyses and interpretation continue.


Data Analysis

The analysis of data relates both to the type of data and the hypotheses posed by the investigator. Most of the time, the first stage in the analysis of continuous variables consists of a scan of the data set. By scanning, one can define a set of basic descriptive statistics that will permit a first approximation to the pattern of behaviour of each variable included in the evaluation. This type of analysis, however, also provides information that can be used in assessing the relative effectiveness and success of the data cleaning and consistency controls already executed. Different levels of scans can be used to secure adequate preliminary descriptions of the study variables. In particular, in the case of discrete variables, frequency tables with single or multiple cross-classification criteria may provide a good description of these variables.

Once the quality of the data collected has been documented and the general descriptions for the study variables have been obtained, the investigator may proceed with the statistical testing of the specific hypotheses. Simple comparison between two classes may be performed using student-t tests. Analysis of variance techniques (6, 7) may be used when testing hypotheses that involve more than two classes, provided proper attention is given to satisfying the basic assumptions underlaying the use of these procedures (8). Trends and associations among variables may be examined by multiple regression and correlation analyses (9, 10, 11). The classification and identification of groups of observations may be performed using clustering techniques and discriminate analysis (12, 13, 14), while confounded inter-relationships among large sets of variables may be examined using factor analysis (15, 16). Overall relations in sets of variables, regardless of the nature of the variables within the set (continuous or discrete or mixtures), may be tested using canonical correlation analysis (17). Additionally, when interests in a set of ,several dependent variables relate to more than two classes, the analysis may be performed using multivariate analysis of variance techniques (15).

Frequently, it is not possible to satisfy the requirements and conditions inherent in the use of the parametric techniques listed above. Under such conditions, there is the option of using distribution free (non-parametric) techniques (18, 20). The ability of rejecting a null hypothesis, when in fact the alternative hypothesis is true (power of the test), is generally smaller for non-parametric than for parametric procedures. However, under a given set of circumstances, they may be the only choice. On the other hand, the power of non-parametric tests, when properly used, is satisfactory under the general conditions prevailing in most practical situations. The level of generalization possible through non-parametric testing often compensates for the apparent, usually small, reduction in the power of the test.

A partial listing of some useful analytical procedures is presented in table 12.4. Appropriate description of the method of procedure and examples of applications of these methods can be found in the statistical texts cited in presenting this subject matter (6-20).

TABLE 12.4. Common Methods Used in Statistical Analysis

I. Parametric

Univariate Multivariate
Student-t Test Multivariate Analysis of Variance (MANOVA)
Analysis of Variance (ANOVA) Multivariate Analysis of Covariance (MANCOVA)
Analysis of Covariance (ANCOVA) Discriminate Function Analysis
Regression: Simple Factor Analysis
Regression: Multiple Path Analysis
Time Series Analysis Cluster Analysis (Numerical Taxonomy)
Correlation Analysis Canonical Correlation Analysis
  Multidimensional Scaling Analysis

II. Non-parametric

Binominal Test

Lilliefors Test of Normality

Kolmogorov-Smirnov Test

Randomization Test Kruskal-Wallis Analysis of Variance

Fiedman Analysis of Variance

Cochran Q Test

Concordance Tests

Lambda Test

Multicategorical Chi-square Wilcoxon Tests

Fisher's Exact Probability Test

McNemar Test

Eta, The Correlation Ratio Test

Theil's Slope Coefficient Test

Spearman Rank Correlation

On the basis of the general outline of alternatives for data analysis described above, several steps are required for implementing the appropriate analytical procedures. First, the questions to be answered must be explicitly defined, to permit design of the specific analyses required to satisfy the objectives of the evaluation. The original statement of objectives and the preliminary definition of the analytical plan contained in the SOP provide a basis for the final choice of appropriate analytical procedures for answering the questions posed. This, in turn, establishes a sequence of events that relate to programming, data processing and statistical computation. This sequence, therefore, translates into an operation schedule (pathway) that is defined taking into account the most efficient utilization of available analytical facilities (hardware, software, systems analysts, programmers and operators). In the implementation of the operational schedule, the writing, debugging, testing and documenting of computer programmes may be required in the case of very specialized data.

At present, many well tested statistical packages (software) such as SAS, SPSS, BMDP, and RUMMAGE, among others, are available for performing most of the statistical analysis mentioned in table 12.4. When these package are used, the programming chores are minimal, and relate primarily to variable specifications, procedure definition and output selection. In addition, the use of these extensively tested programmes constitutes good insurance against common programming errors. In some cases, interphasing of standard statistical packages is possible and this increases both the capability and efficiency of available software for widespread application of statistical analysis techniques.

The general guidelines described by the investigator and data processing personnel must translate into sequence of events that, as indicated by Helms (21), can be summarized in flow chart form as illustrated in figure 12.3 outlined as follows:

1. State the questions to be answered and the general analyses to be performed in the SOP for the evaluation. Write down the scientific objectives of the analysis.

2. Plan the sequence of the steps required in programming, data processing and statistical computations. Draw up an "operational plan."

3. Schedule the performance of each step, including personnel assignment and definition of deadlines. Draw up an "operational schedule."

4. Begin work on the problem.

5. Write, debug, test, and document an "inclusion" computer subprogramme to assess results on the basis of specific criteria for including or excluding a case from the analysis.

6. Develop specifications (control cards) which define the variables to be used in the analysis. These specifications constitute the input required to operated the Master

Update Programmes, which will copy the desired variables onto a "raw analysis file" while performing an update run.

7. Incorporate the "inclusion subprogramme" (from step 5) into the Master Update Programme. This subprogramme "tells" the Master Update Programme which cases should be copied into the "raw analysis file" ("inclusions") and which should not ("exclusions ").

8. Execute an update run of the Master Update Programme to produce the "raw analysis file" (steps 6 and 7 are preparatory: this step actually produces the file).

9. Check the raw analysis file for correct format, correct variables, and correct cases (inclusions/exclusions). If not correct, determine the cause of errors, correct the problem, and return to step 6 or 7, as indicated.

10. Duplicate the raw analysis file and store the copy in a secure place as backup.

11. Design, write, test, debug, and document all "transformation programmes" required to perform data transformations and produce a "transformed analysis file." This step may include programmes for linking data from two or more raw analysis files.

12. Set up and execute the transformation programmes (step 11 ) and produce the transformed analysis file. Check the file; if errors are found, determine their origin, make the required corrections (this could involve any of steps 5-11). If no errors are found, proceed to step 13.

13. Make a backup copy of the transformed analysis file and store in a secure place.

14. Perform computations for preliminary statistical analyses, using the "latest" analysis file. Typical calculations include statistics usually called "descriptive statistics": histograms, percentiles, means, medians, standard deviations, skewness, and other moments, cross-tabulations, scatter diagrams, correlations, regressions, etc.

15. Examine the output from step 14 for "outliers" and other indications of erroneous values. Trace such "outliers" to the original data and determine which can be identified as errors and which are correct.

  • Data errors must be corrected on the data master file and the process must return to step 8 for creation of a new, corrected, analysis file.
  • Errors caused by incorrect specifications of inclusion criteria require that the specifications be corrected and that the process returns to step 5.
  • Programming errors require that the programme involved be corrected. The process described herein thereafter returns to one of the earlier steps, depending upon which programme was in the error.
  • After errors have been corrected, the preceding steps executed, new errors found, etc., and eventually no more errors are detected in this step, one proceeds to step 16.

16. Write a summary of the subject-matter results of the preliminary analysis.

17. Re-examine the scientific objectives document and the operational plan (steps 1,2). If changes are made, return to step 1. Some steps may not need to be repeated; this will be indicated in the new operation plan.

18. Design, write, debug, test, and document statistical computation programmes required for the statistical analyses.

NOTE: This step may be a long, involved process, not just another step in the procedure. Whenever this step is required, other personnel are usually assigned to it and the work proceeds concurrently with steps 4-16.

19. Perform the statistical computations required for the desired analyses.

20. Analyse the output created in step 19 and write the preliminary conclusions.

NOTE: Typically, a number of different analyses will be required in addition to the preliminary analyses performed in steps 14-16. One analysis or set of computations frequently generates ideas for performing other analyses, which is all a part of the art of statistical analysis.

21. Determine whether additional calculations are needed. If so:

  • Return to step 19 if the necessary data are on the analysis file and no additional programming is required.
  • Return to step 18 if the necessary data are on the analysis file but additional programming is needed.
  • Return to step 2 if the necessary data are not on the analysis file.

This decision usually involves personnel outside the coordinating center: project officier, participating physicians, etc.

22. When no further calculations are needed, write up the results for distribution or publication.

Concluding remarks

As has been suggested previously by Guzman (22). and as stated repeatedly in this chapter, the processing and analysis of data must be a continuous undertaking. Procedures should be carefully defined in the study standard operating protocol (SOP). which defines activities from the day the field operations start and concludes only when all reports have been completed. It is not easy to describe in detail the handling of data without reference to a specific study, the corresponding research design and the particular set of circumstances. Accordingly, in this chapter we have presented the sequence of events and described in general terms some of the basic procedures that, through experience, have been found to be essential components of a successful data management system. With proper adjustment, the illustrative examples might serve as a guideline for effective data recording and processing procedures in a specific study. In a recent book, Cosley and Luny (23) describe additional examples of procedures and present a more extensive treatment of this subject.


  1. C Selltiz, S. Wrightsman and S.W. Cook Research methods in social relations. Third Edition (Holt, Rhinenart and Winston, New York; 1976).
  2. M.A. Guzman, C.A. McMaham. H.C. McGill, J.P. Strong. C. Tejada. C. Restrepo. D A. Eggen, W.B. Robertson and L.A. Solberg. "Selected methodologic aspects of the international ahterosclerosis project" lab. Invest. 18: 479 497 (1968).
  3. J.P. Habichat, "Estandarizacie metodos epidemiologicos cuantitativos sobre el terreon" Boll Of Sanit. Panam, 76 375-384 (1974).
  4. R. Feber, D.P. Sheatsley, A Turner and J. Waksberg. What is a survey? (American Statistical Association, Washington, DC, 1980)
  5. J. G. Burch, and F.R. Strater. information systems: theory and practice. (John Wiley & Sons, Inc., New York, 1974).
  6. H. Scheffe, The analysis of variance. (John Wiley & Sons, Inc., New York, 1969)
  7. G.W. Snedecor, and W.G. Cochran Statistical methods. Sixth Edition (The Iowa State University Press, Ames, 1967).
  8. C. Eisenhart, "The assumption underlying the analysis of variance". Biometrics, 3: 1-21, (1947).
  9. S.R. Searle, Linear models (John Wiley & Sons, Inc., New York, 1971).
  10. F.A. Graybill. An introduction to linear statistical models Vol 1. (McGraw-Hill Book Company, New York, 1961).
  11. J. Neter, and W. Wasserman. Applied linear statistical models. (Richard D Irwin, Inc., Homerwood, III., 1974)
  12. J.A. Hartigan, Clustering algorithms. (John Wiley & Sons, In, New York, 1975)
  13. M.M. Tatsuoka, Discrimate analysis. (Institute for Personality and Ability Testing, Champaign, III., 1970).
  14. N.M. Timm, Multivariate analysis with application in education and psychology (Wadsworth, Belmont, Cal, 1975).
  15. P.E. Green, Analyzing multivariate data (The Dryden Press, Hinsdale, III., 1978).
  16. D. N. Lawley and A. E Maxwell, Factor Analysis as a Statistical Method, (Butterworths, London, 1963).
  17. D.F Morrison, Multivariate Statistical Methods. 2nd edition (McGraw-Hill Book Company, New York, 1976).
  18. M. Holiander and D.A. Wolfe. Nonparametric Statistical Methods (John Wiley & Sons. New York, 1977)
  19. L.A. Marascuilo and M. McSweeney, Nonparametric and Distribution-Free Methods for the Social Sciences (Books/Cola Publishing Co., Monterey, CA., 1977).
  20. T.R. Harshbarger, Introductory Statistics, 2nd edition. (Mcmillan Publishing Co., Inc, New York, 1977).
  21. W. Helms, "Data Analysis Procedures for a Coordinating Center of a Large Collaborative Study," Mimeo Series No1003 (The Institute of Statistics, Chapel Hill, N.C., 1975).
  22. M.A. Guzman, "Some Considerations in the Design and Execution of Nutritional Field Studies," in N.W. Scrimshaw and A.M. Altschul, eds., Amino Acid Fortification of Protein Foods (The MIT Press, Cambridge, Mass., 1971), p. 301-315
  23. D.J. Cosley and D.A. Luny, Data Collection in Developing Countries (Oxford University Press, New York, 1981).