Cover Image
close this bookMedicine - Epidemiology (ECHO - NOHA - Network on Humanitarian Assistance) (European Commission Humanitarian Office, 1994, 120 p.)
close this folderChapter 1: Epidemiology and biostatistics
View the documentSection 1 - Presentation and summarising of data
View the documentSection 2 - Measures of disease frequency and association
View the documentSection 3 - Planning and conducting an investigation

Section 3 - Planning and conducting an investigation

A - Objectives of study

The starting point of any investigation must be to define clearly its objectives, since these will determine the appropriate study design and the type of data needed. Objectives may be categorized into one of three main types as listed below. An investigation usually has several objectives, which can of course be of different types.

- Estimation of certain features of a population. For example, what is the average number of diarrhoeal episodes per year experienced by under-5-year-olds in Bangladesh?

- Investigation of the association between a factor of interest and a particular outcome, such as disease or death.

- Evaluation of a drug or therapy or of an intervention aimed at reducing the incidence (or severity) of disease. For example, does the use of sleeping nets reduce the risk of malaria, and if so does spraying the nets with insecticide afford additional protection ?

B - Observational studies

In general, it will be necessary to carry out a special study to collect the relevant data to answer the specific objectives. Estimation and association objectives lead to studies which are observational in nature; the natural history of disease is observed with no attempt made in the study to alter it. Evaluation objectives may be answered by either observational or experimental studies, depending on the type of measure being evaluated and whether it is already in use.

Observational study designs may be divided into three major groups, cross-sectional, longitudinal (including cohort) and case-control studies. People may be selected individually or in clusters. The available resources and logistic difficulties often mean, however, that is is not possible to examine a sufficient number of clusters to gain a representative picture, and results are not uncommonly based on a survey in just one community. Judgement of representativeness is then subjective rather than statistical.

C - Cross-sectional study

A cross-sectional study is carried out at just one point in time or over a short period of time. Cross-sectional studies are relatively quick, cheap and easy to carry out, and straightfortward to analyse. Since they provide estimates of the features of a community at just one point in time, however, they are suitable for measuring prevalence but not incidence of disease, and associations found may be difficult to interpret. For example, a survey on onchocerciasis showed that blind persons were of lower nutritional status than non-blind. There are two possible explanations for this association. The first is that those of poor nutritional status have lower resistance and are therefore more likely to become blind from onchocerciasis. The second is that poor nutritional status is a consequence rather than a cause of the blindness, since blind persons are not as able to provide for themselves. Longitudinal data are necessary to decide which is the better explanation.

D - Longitudinal study

In a longitudinal study individuals are followed over time, which makes it possible to measure incidence of disease and changes over time and easier to study the natural history of disease. Occasionally the acquisition of data may be restrospective, being carried out from past records. More commonly it is prospective and, for this reason, longitudinal studies have often been alternatively termed prospective studies.

In the majority of cases, the simplest way to carry out a longitudinal study is to conduct repeated cross-sectional surveys at fixed intervals and to enquire about, or measure, changes that have taken place between surveys, such as births, deaths or the occurrence of new episodes of disease. The interval chosen will depend on the factors being studied. For example, to measure the incidence of diarrhoea, which is characterized by repeated short episodes, data may need to be collected weekly to ensure reliable recall. To monitor child growth, on the other hand, would require only monthly or 3-monthly measurements.

The population under study may be either dynamic or fixed. In a dynamic situation, individuals leave the study when they no longer conform to the population definition, while new individuals satisfying the conditions may join. An example would be the study of incidence of diarrhoea in under-5-year-olds, in which monitoring of children would cease when they attained their fifth birthday, while newborns would be recruited into the population as they were born. In a fixed situation, on the other hand, the population is defined at the onset and, apart from deaths, migrations, and other losses to follow-up, its composition remains unchanged throughout the study. A fixed population is often called a cohort, such as the birth cohort of 1990 that is all people born in 1990.

Longitudinal studies tend to be more costly and to pose many logistic problems in their execution.

E - Case-control-study

A case-control study is used to investigate the association between a certain factor and a particular disease. The design is very different to other types of studies because the sampling is carried out according to disease status rather than exposure status. A group of individuals identified as having the disease, the cases, is compared with a group not having the disease, the controls.

F - Questionnaire design

In most studies it will be necessary to prepare a specially designed record form or questionnaire for collecting the data. This should be kept as brief as possible, and the temptation to ask every conceivable question should be resisted. Overlong questionnaires are tiring for interviewer and interviewee alike, and may lead to unreliable responses. Questions should be clear and unambiguous and written exactly as they are to be read out. Technical jargon or long words should be avoided, as should negative questions, leading questions, and hypothetical questions.

Careful thought should be given to the order in which information is collected and to whether the questionnaire should be self-administered or completed by an interviewer. There should be a logical progression through the form which is easily followed. Questions are best arranged in sections. It should be clearly indicated whenever the completion of a section is dependent on the response to a previous question, and the starting point of the next section should be easily identifiable. It is best to minimize the number of skips or jumps which can occur, since too many can be confusing, and sections can be accidentally missed. The questionnaire should be clearly labelled with the titel of the study and with the respondent's name and study identification number. These should be repeated on the top of each page. The next section usually consists of general identifiying information such as age, sex, and address. In general it is a good idea to arrange subsequent sections in order of importance of the information to the study, so that the most important information is collected when the interviewer and interviewee are freshest and least bored. Any sensitive questions are, however, probably best left to the end.

G - Open and closed questions

Questions may be in one of two forms, open or closed. Open questions are used to search for information, and the interviewer records the replies in a freely written form. There are no preconceived ideas about what the possible responses might be. In a closed question, on the other hand, the response is restricted to one of a specified list of possible answers. This list should include a category for ‘Other' with space to write in the details and a category for ‘Don't know'. The interviewer may either lead the respondent through the list category by category, or ask the question in an open form and then tick the category which most closely corresponds to the answer given. Open and closed forms each have advantages and disadvantages, and the choice very much depends on the particular context. Responses to closed questions are considerably more straightforward to process, but open questions can yield more detailed and in-depth information. One possibility is to use an open form during the pilot phase of the study and to use the results of this to draw up the list of answers for a closed question form for the main study.

H - Coding

Numerical data should be recorded in as much detail as possible and as individual values rather than precoded on the questionnaire into groups. Consider the example of age. The preferred option is to record date of birth and later to calculate age from this and the date of interview. The next best is to record the respondent's age in, for example, years for adults, months for young children, weeks for infants, and days for newborns. The least satisfactory approach is simply recording in which age-group, such as 0-4, 5-9, 10-14, 15-24, 25-44, 45-64 or 65 + years, the respondent belongs. The units of any measurements should also be clearly specified, for example whether weight is to be recorded in kilograms or pounds, and the number of digists of accuracy required should also be indicated. With closed questions, the corresponding numerical codes should be printed alongside the listed choice of responses. With open questions, it will be necessary to code the replies after the questionnaire has been completed and space should be allowed for this.

If data are to be entered onto a computer then this should be borne in mind when designing the questionnaire. For example, it may speed up the data entry procedure if all the information to be entered is coded into boxes arranged down the right-hand side of the form. In most cases it will be necessary to assign numerical codes to the responses to non-numerical variables, such as code 1 for male and 2 for female. One box is needed for each digit (or letter), the total number of boxes required for a variable being determined by the number of digits in the maximum response likely to be recorded for the variable. Where possible the use of code zero should be avoided, since on many computer systems and in many statistical software packages it is impossible to distinguish zero from a blank response, meaning no data.

I - Multiple response questions

Multiple response questions require special consideration. They can be dealt with in two different ways. For example, in rural West Africa a family may use one or more of eight possible sources (rainwater, borehole, well, spring, river, lake, pond, stream) for their drinking water. The first method is to assign a separate coding box to each possible response, in this case eight. Each box would contain either code ‘1' for source used or ‘2' for source not used. Thus if a family used rainwater and also collected water from a river, the rainwater box (number 1) and the river box (number 5) would contain code 1, while the other six boxes would contain code 2. The second method is to instead assign codes 1 to 8 to the eight different sources and to decide on the maximum number of responses any family is likely to give. Suppose we decided that three responses was the limit. We would then allocate three separate coding boxes to the question and enter in these the code numbers of the sources named. The family using rainwater and river water would be coded as 1 (rainwater) in box 1 and as 5 (river) in box 2. Box 3 would be left blank. The codes should be entered either in numerical order, as done here, or in some other logical order, such as amount of usage of source.

J - Data checking

Each questionnaire should be carefully checked after completion, and again once the data have been entered onto the computer. The importance of this should not be overlooked. Checking should take place as soon after data collection as possible in order to allow the maximum chance of any resulting queries being resolved. Checks are basically of two sorts, range checks and consistency checks. Range checks exclude, for example, the erroneous occurrence of code 3 for sex, which should only be code 1 (male) or 2 (female). Consistency checks detect impossible combinations of data such as a pregnant man, or a 3-year-old weighing 70 kg. Scatter diagrams showing the relationship between two variables are particularly helpful in doing this, as they allow odd combinations of, for example, height, and weight to be easily spotted.

Three basic precautions are recommended to minimize errors occurring during the handling of data. The first is to avoid any unnecessary copying of data from one form to another. The second is to use a verification procedure during data entry. Data are entered twice, preferably by two different persons, as this gives an independent assessment of any poorly written figures. The two data sets are then compared and any discrepancies resolved. The third is to check all calculations carefully, either by repeating them or, for example, by checking that subtotals add to the correct overall total. When using a computer, all procedures should be tried out initially on a small subset of the data and the results validated by hand.