  Medicine - Epidemiology (ECHO - NOHA - Network on Humanitarian Assistance) (European Commission Humanitarian Office, 1994, 120 p.)  Chapter 1: Epidemiology and biostatistics  Section 1 - Presentation and summarising of data Section 2 - Measures of disease frequency and association Section 3 - Planning and conducting an investigation

### Section 1 - Presentation and summarising of data

A - Types of data

Raw data of an investigation consist of observations made on individuals. In many situations the individuals are people, but it needs not to be. For instance, they might be red blood cells or hospitals. The number of individuals is called the sample size. Any aspect of an individual which is measured, like age, weight or sex, is called a variable.

It is often useful to distinguish between three types of variables: qualitative, discrete, and continuous. Discrete and continuous variables are often called quantitative. Qualitative (or categorical) data arise when individuals fall into separate classes. These classes may have no numerical relationship with one another, e.g. sex : male, female; eye colour: brown, grey, blue, green.

Discrete data are numerical, arising from counts. Their values are integers (whole numbers), like the number of people in a household or the number of cases of pertussis in a week. If the values of the measurements can take any number in a range, such as height or weight, the data are said to be continuous.

In practice there is an overlap between these categories. Most continuous data are limited by the precision with which measurements can be made. Human height, for example, is difficult to measure more precisely than to the nearest millimeter and is usually measured to the nearest centimeter. So, only a limited set of possible measurements is actually available, although the quantity “height” can take an infinite number of possible values. The measured height is really discrete. However, the methods described below for continuous data are as appropriate for discrete variables.

B - Frequency distributions

If data are purely qualitative, the simplest way to deal with them is to count the number of cases in each category.

The count of individuals in each category is called frequency of that category, for example, the frequency of death through earthquakes is 389,700. The proportion of individuals in this category, related to all deaths, is called the relative frequency. The relative frequency of deaths through earthquakes is 389,700/1,011,200 = 0.385 or 38,5 per cent.

If the categories are ordered, we can use another set of summary statistics, the cumulative frequencies.

The 2,115 victims of accidents were classified according to rough impressions based on their injury. Such a classification could be useful for planning the resources for medical help. The cumulative frequency of a value of a variable is the number of individuals with values less than or equal to that value. Thus, if we order grade of injury from slight to critical, the cumulative frequencies are 81; 521(= 81 + 440); 1,567 (= 81 + 440 + 1,046) etc.. The cumulative relative frequency of value is the proportion of individuals in the sample with values less than or equal to that value. For example, they are 0.038 (= 81/2115), 0.246 (= 521/2115) etc.. Thus, we can see that the proportion of victims with at the most serious injuries is 0.741 or 74.1 per cent.

This frequency distribution is not a very informative summary of the data, most of the values occurring only once. The cumulative frequencies are quite satisfactory.

To get a useful frequency distribution we need to divide the height scale into class intervals, e.g. from 155 to 160, from 160 to 165 and so on, and count the number of individuals in each class interval. The class intervals should not overlap, so we must decide which interval contains the boundary point to avoid it being counted twice. It is usual practice to put the lower boundary of an interval into that interval and the higher boundary into the next interval. Thus, the interval starting at 155 and ending at 160 contains 155 but not 160.

C - Histograms and other frequency graphs

The frequency distribution can be calculated easily and accurately by using a computer. Without using a computer data should be ordered from lowest to highest value before making the interval boundaries and counting. This is rather like starting from Table 3.

Graphical methods are very useful for examining frequency distributions. Figure 1 shows a graph of the cumulative frequency distribution for the height data. This plot is very useful for calculating some of the summary statistics presented later. The most common way of depicting a frequency distribution is by a histogram. This is a diagram where the class intervals are on an axis and rectangles with heights or areas proportional to the frequencies erected on them. The vertical scale shows the relative frequency of observations in each interval.

We often want to summarize a frequency distribution in a few numbers, for facilitating reporting or comparison. The most direct method is to use quantiles. The quantiles are sets of values which divide the distribution into a number of parts so that there are equal numbers of observations in each part. For example, the median is a quantile. The median is the central value of the distribution, so that half the points are less than or equal to it, and half are greater than or equal to it. If we have an even number of points, we choose a value mean between the two central values. For the height example, we have 22 values, so we have to take the middle bet-ween the two central values (11th and 12th of theorderd values) (176 + 178)/ 2 = 177 cm, which we easily get from the cumulative frequencies in . We can get any quantiles easily from the cumulative frequency distribution.

We often want to summarize a frequency distribution in a few numbers, for facilitating reporting or comparison. The most direct method is to use quantiles. The quantiles are sets of values which divide the distribution into a number of parts so that there are equal numbers of observations in each part. For example, the median is a quantile. The median is the central value of the distribution, so that half the points are less than or equal to it, and half are greater than or equal to it. If we have an even number of points, we choose a value mean between the two central values. For the height example, we have 22 values, so we have to take the middle bet-ween the two central values (11th and 12th of theorderd values) (176 + 178)/ 2 = 177 cm, which we easily get from the cumulative frequencies . We can get any quantiles easily from the cumulative frequency distribution.

In general, we estimate the q quantile, the value so that a proportion q will be below it, as follows: We have n ordered observations which divide the scale into n + 1 parts: below the lowest observation, above the highest and between each adjacent pair. The proportion of the distribution which lies below the i-th observation is estimated by i / (n + 1). We set this equal to q and get i = q (n + 1). If i is an integer, the i-th observation is the required quantile. If not, let j be the integer part of i, the part before the decimal point. Then we take the (j + 1)th observation as the q quantile. Other quantiles which are particularly useful are the quartiles of the distribution. The quartiles divide the distribution into four equal parts. For the height data the first quartile is 174 cm: i = 0.25 x 22 = 5.5. Therefore, the 1st quartile is the 6th observation which we get again from the frequency distribution. We often divide the distribution into centiles. For the 10th centile of height i = 0.1 x 22 = 2.2, so the 10th centile is the 3rd observation, 168 cm. We can estimate them from Figure 1 by finding the position of the quantile on the vertical axis, e.g. 0.1 for the 10th centile or 0.9 for the 90th centile, drawing a horizontal line to intersect the cumulative frequency graph, and reading the quantile off the horizontal axis.

A convenient figure summary of a distribution is the box and whisker plot, which uses the median, quartiles, maximum and minimum of the observation.

D - The mean

The median is not the only measure of central value for a distribution. Another is the arithmetic mean or average, usually referred to simply as the mean. It is found by taking the sum of the observations and dividing it by their number. For the height example the sum of all values is 3906 , so the mean is 3,906/22 = 177.5.

At this point we need to introduce some algebraic notation, widely used in epidemiology. We denote the observations by :

x1, x2,..., xi, ...xn

There are n observations and the i-th of these is xi.

The summation sign is an upper-case Greek letter, sigma, the Greek S. When it is obvious that we are adding the values of xi for all values of i, which runs from 1 to n,

The mean of the xi is denoted by x, pronounced ‘x bar', and x = 1/n - xi.

In this example the mean is very close to the median, 177. If the distribution is symmetrical the mean and median will be about the same, but in a skewed distribution they will not.

E - Variance and standard deviation

The mean and median are measures of the central tendency or position of the middle of the distribution. We shall also need a measure of the spread, dispersion or variability of the distribution.

One obvious measure is the range, the difference between the highest and lowest value. This is a useful descriptive measure, but is has two disadvantages. First, it depends only on the extreme values and so it can vary a lot from sample to sample. Secondly, it depends on the sample size. The larger the sample size, the further apart the extremes are likely to be.

The most commonly used measures of dispersion are the variance and standard deviation, which we shall describe now. We start by seeing how each observation differs from its mean. Table 6 shows the deviations from the mean of the 22 observations of height. If the data are widely scattered, many of the observations will be far from the mean, and so many deviations will be large. If the data are narrowly scattered, very few observations will be far from the mean and so few deviations will be large. We square the deviations and then add them, as shown in Table 6. This gives us :

In the example equal to 1,269.5. For an average squared deviation, we divide the sum of squares by (n - 1), not n.

The estimate of variability is called the variance, defined as follows:

The variance is calculated from the squares of the observations. This means that it is not in the same unit as the observations, which limits its use as descriptive statistic. The obvious solution is to take the square root, which will then have the same unit as the observations and the mean. The square root of the variance is called the standard deviation, denoted by s:

(s =_ variance )

### Section 2 - Measures of disease frequency and association

A - The denominator

The notations introduced in this chapter are used in epidemiological studies for the description of diseases. Usually some exposure (like smoking) is regarded with respect to certain diseases (e.g. lung cancer). Measures of effect describe the association between exposure und disease. Although these notations sometimes may sound strange, in the context of disasters the measures however give sense: exposures may be living locations (see Table 7), and “disease” usually are death or injury. This chapter is orientated on a paper on Cesar G. Victora (1993).

This tornado was placed among the severest 3 percent of all tornadoes in the United States. In the 2 weeks following the tornado Glass and his coworkers had interviewed families of the deceased Wichita Falls residents and persons who were seriously injured. Based on this study the authors were able to estimate the number of people at risk in the different locations. A statement like “equal number of fatal and serious injuries took place in ‘mobile homes' and ‘apartments' ” being based on the frequency count of 4 may be misleading. Obviously, much more people have been in apartments and the risk of being injured is low (1.1 per 1,000) compared to mobile homes (13.3 per 1,000). This pitfall “floating numerators” can be solved by using the appropriate denominator.

Choice of the appropriate denominator is one of the most important tasks of an epidemiologist. The most commonly used denominators are

- total number: the total number of persons under study at a given time
- non diseased: the number of persons who do not have the disease of interest at a given time
- person time: the number of persons at risk multiplied by the time for which each remains at risk.

These denominators are essential for measuring disease frequency. Before describing these measures, however, it is useful to recall some basic definitions. A ratio is the quotient of any two numbers. For example, the female to male ratio is greater than one in most communities. Ratios used in epidemiology range from 0 to + -. A proportion is a special type of ratio in which the denominator contains the numerator. For instance, 0.53 % of people living in the area of the Wichita Falls tornado have been injured. A proportion must range from 0 to 1, or 0 % to 100 %. Odds are the number of events divided by the number of non-events. Odds, although common in betting, are harder to interpret than proportions. They vary from 0 to + _, often being expressed as 1:2 (that is, 1 case per 2 non-cases). In the example of the Wichita Falls tornado the odds of serious to fatal injuries is 52:35 or about 15:10, that is 15 seriously injured people on 10 fatally injured people. Different types of epidemiological studies allow calculation of different measures of disease frequency.

The figure represents the group under study. At time t0, a0 individuals already have the disease of interest and c0 do not. Of c0, b1 will acquire the disease by t1, while c1 remain healthy. At the end of the study (t2), c2 will still be unaffected.

This scheme assumes that the disease occurs only once in each individual and that there are no losses to follow-up.

B - Prevalence

In cross-sectional studies, subjects are examined once. The number of cases may then be divided by the total number of persons studied (denominator ‘total number'). This is usually called prevalence. In the figure at t0, the prevalence is equal to a0 / (a0 + c0), while at t2 it equals (a0 + b2)/(a0 + b2 + c2). The prevalence is a proportion. In the example the prevalence of people staying in single family houses during the tornado was 59.1 % (9,705/16,420).

C - Incidence

The situation is more complex in so called cohort or incidence studies in which subjects are followed over time. Incidence studies usually exclude individuals who are already affected at the beginning (a0). A first choice of denominator is therefore the number of non diseased persons, the initial population at risk (c0). If they are followed up until t2, the number of new or incident cases (b2) divided by c0, gives the so called incidence risk, also known as cumulative incidence. Cumulative incidence is a proportion. An example of a (short) follow-up-study is given in Table 8. The data have been collected after the 1980 earthquake at Compania, Italy. The earthquake trapped c0 = 548 people. Until the first time point t1 = 12 h, b1 = 134 people have been extracted, giving (cumulative) incidence for extraction 134 / 548 = 0.24. At t3 = 2 days the cumulative number of extracted people is 436, giving a cumulative incidence of 436/548 = 0.80.

The cumulative incidence has 2 disadvantages: firstly, subjects who develop the disease, who die from other causes or who are lost to follow-up can no longer be detected as incident cases. For our example of the earthquake this is not a serious problem because the follow-up time is very short. Secondly, the cumulative incidence may cover very different forces of being extracted on different days. This leads on to the development of an incident rate. Its denominator is expressed as person-time units. The incidence rate may be calculated for various time intervals. If for the earthquake example we take one day as a time-unit, with exception of the first day the incidence rate is equivalent to the proportion extracted in the table: 0.26, 0.34, 0.22, 0.11 for day 1 to day 4. For the first day the time interval is only half a day. So, the incidence rate (per day) is 0.48 for the first 12 hours and 0.90 for the second 12 hours.

An incidence rate is a ratio, ranging from 0 to + -, because the numerator (events) is not contained in the denominator (person-time). It reflects the velocity of change in some characteristics of the population. For recurrent diseases, incidence rates may be greater than 1 per person-time unit. For example, in many developing countries there are around 3 diarrhoea episodes per child-year.

D - Proportionate and case-fatality rates

Even without population data a denominator can still be obtained. For instance, in the given example of Wichita Falls Tornado the number of fatal injuries in vehicles may be related to the overall number of fatal injuries. This proportion is often called proportionate mortality rate.

Proportionate rates are not as useful as those based on the population at risk. For instance, the number of deaths due to a particular cause may be related to the overall number of deaths in the same period. This denominator allows calculations of a (proportionate) mortality rate, which is actually a proportion.

The proportion of deaths among new cases of disease in a given period is the case-fatality rate. This is often erroneously referred to as mortality rate: “rabies is a disease with high mortality” is not true in most places, although its case-fatality is high everywhere. A case-fatality rate is a proportion, ranging from 0 to 1.

E - Measures of effect

An effect relates to the association between an exposure and a disease. Effects may be expressed in relative or absolute terms.

Relative effects are expressed as ratios, that is, quotients of two frequency measures. They are often referred to as relative risks. Their general form is:

 frequency among exposed ratio = ———————————— frequency among unexposed

Because both frequencies must be expressed in the same units, such a ratio is dimensionless, ranging from 0 to + -. For example, people in vehicles have had about a seven times higher risk of being injured than those staying in single-family-houses.

The relative risk does not take the absolute number of injuries into account. This is done by a so called (population) attributable risk which measures the percentage of injuries that could have been avoided if all people had stayed in a location with minimal risk. So 68 % of the injuries could have been avoided if these persons, who stayed in their vehicles, would have been in apartments. As it is shown in Table 3 a high percentage (57 %) of injured people would have been avoided if single family houses were more resistant against earthquakes. Although the relative risk is rather small, a high number of people staying in that location would yield a high attributable risk.

### Section 3 - Planning and conducting an investigation

A - Objectives of study

The starting point of any investigation must be to define clearly its objectives, since these will determine the appropriate study design and the type of data needed. Objectives may be categorized into one of three main types as listed below. An investigation usually has several objectives, which can of course be of different types.

- Estimation of certain features of a population. For example, what is the average number of diarrhoeal episodes per year experienced by under-5-year-olds in Bangladesh?

- Investigation of the association between a factor of interest and a particular outcome, such as disease or death.

- Evaluation of a drug or therapy or of an intervention aimed at reducing the incidence (or severity) of disease. For example, does the use of sleeping nets reduce the risk of malaria, and if so does spraying the nets with insecticide afford additional protection ?

B - Observational studies

In general, it will be necessary to carry out a special study to collect the relevant data to answer the specific objectives. Estimation and association objectives lead to studies which are observational in nature; the natural history of disease is observed with no attempt made in the study to alter it. Evaluation objectives may be answered by either observational or experimental studies, depending on the type of measure being evaluated and whether it is already in use.

Observational study designs may be divided into three major groups, cross-sectional, longitudinal (including cohort) and case-control studies. People may be selected individually or in clusters. The available resources and logistic difficulties often mean, however, that is is not possible to examine a sufficient number of clusters to gain a representative picture, and results are not uncommonly based on a survey in just one community. Judgement of representativeness is then subjective rather than statistical.

C - Cross-sectional study

A cross-sectional study is carried out at just one point in time or over a short period of time. Cross-sectional studies are relatively quick, cheap and easy to carry out, and straightfortward to analyse. Since they provide estimates of the features of a community at just one point in time, however, they are suitable for measuring prevalence but not incidence of disease, and associations found may be difficult to interpret. For example, a survey on onchocerciasis showed that blind persons were of lower nutritional status than non-blind. There are two possible explanations for this association. The first is that those of poor nutritional status have lower resistance and are therefore more likely to become blind from onchocerciasis. The second is that poor nutritional status is a consequence rather than a cause of the blindness, since blind persons are not as able to provide for themselves. Longitudinal data are necessary to decide which is the better explanation.

D - Longitudinal study

In a longitudinal study individuals are followed over time, which makes it possible to measure incidence of disease and changes over time and easier to study the natural history of disease. Occasionally the acquisition of data may be restrospective, being carried out from past records. More commonly it is prospective and, for this reason, longitudinal studies have often been alternatively termed prospective studies.

In the majority of cases, the simplest way to carry out a longitudinal study is to conduct repeated cross-sectional surveys at fixed intervals and to enquire about, or measure, changes that have taken place between surveys, such as births, deaths or the occurrence of new episodes of disease. The interval chosen will depend on the factors being studied. For example, to measure the incidence of diarrhoea, which is characterized by repeated short episodes, data may need to be collected weekly to ensure reliable recall. To monitor child growth, on the other hand, would require only monthly or 3-monthly measurements.

The population under study may be either dynamic or fixed. In a dynamic situation, individuals leave the study when they no longer conform to the population definition, while new individuals satisfying the conditions may join. An example would be the study of incidence of diarrhoea in under-5-year-olds, in which monitoring of children would cease when they attained their fifth birthday, while newborns would be recruited into the population as they were born. In a fixed situation, on the other hand, the population is defined at the onset and, apart from deaths, migrations, and other losses to follow-up, its composition remains unchanged throughout the study. A fixed population is often called a cohort, such as the birth cohort of 1990 that is all people born in 1990.

Longitudinal studies tend to be more costly and to pose many logistic problems in their execution.

E - Case-control-study

A case-control study is used to investigate the association between a certain factor and a particular disease. The design is very different to other types of studies because the sampling is carried out according to disease status rather than exposure status. A group of individuals identified as having the disease, the cases, is compared with a group not having the disease, the controls.

F - Questionnaire design

In most studies it will be necessary to prepare a specially designed record form or questionnaire for collecting the data. This should be kept as brief as possible, and the temptation to ask every conceivable question should be resisted. Overlong questionnaires are tiring for interviewer and interviewee alike, and may lead to unreliable responses. Questions should be clear and unambiguous and written exactly as they are to be read out. Technical jargon or long words should be avoided, as should negative questions, leading questions, and hypothetical questions.

Careful thought should be given to the order in which information is collected and to whether the questionnaire should be self-administered or completed by an interviewer. There should be a logical progression through the form which is easily followed. Questions are best arranged in sections. It should be clearly indicated whenever the completion of a section is dependent on the response to a previous question, and the starting point of the next section should be easily identifiable. It is best to minimize the number of skips or jumps which can occur, since too many can be confusing, and sections can be accidentally missed. The questionnaire should be clearly labelled with the titel of the study and with the respondent's name and study identification number. These should be repeated on the top of each page. The next section usually consists of general identifiying information such as age, sex, and address. In general it is a good idea to arrange subsequent sections in order of importance of the information to the study, so that the most important information is collected when the interviewer and interviewee are freshest and least bored. Any sensitive questions are, however, probably best left to the end.

G - Open and closed questions

Questions may be in one of two forms, open or closed. Open questions are used to search for information, and the interviewer records the replies in a freely written form. There are no preconceived ideas about what the possible responses might be. In a closed question, on the other hand, the response is restricted to one of a specified list of possible answers. This list should include a category for ‘Other' with space to write in the details and a category for ‘Don't know'. The interviewer may either lead the respondent through the list category by category, or ask the question in an open form and then tick the category which most closely corresponds to the answer given. Open and closed forms each have advantages and disadvantages, and the choice very much depends on the particular context. Responses to closed questions are considerably more straightforward to process, but open questions can yield more detailed and in-depth information. One possibility is to use an open form during the pilot phase of the study and to use the results of this to draw up the list of answers for a closed question form for the main study.

H - Coding

Numerical data should be recorded in as much detail as possible and as individual values rather than precoded on the questionnaire into groups. Consider the example of age. The preferred option is to record date of birth and later to calculate age from this and the date of interview. The next best is to record the respondent's age in, for example, years for adults, months for young children, weeks for infants, and days for newborns. The least satisfactory approach is simply recording in which age-group, such as 0-4, 5-9, 10-14, 15-24, 25-44, 45-64 or 65 + years, the respondent belongs. The units of any measurements should also be clearly specified, for example whether weight is to be recorded in kilograms or pounds, and the number of digists of accuracy required should also be indicated. With closed questions, the corresponding numerical codes should be printed alongside the listed choice of responses. With open questions, it will be necessary to code the replies after the questionnaire has been completed and space should be allowed for this.

If data are to be entered onto a computer then this should be borne in mind when designing the questionnaire. For example, it may speed up the data entry procedure if all the information to be entered is coded into boxes arranged down the right-hand side of the form. In most cases it will be necessary to assign numerical codes to the responses to non-numerical variables, such as code 1 for male and 2 for female. One box is needed for each digit (or letter), the total number of boxes required for a variable being determined by the number of digits in the maximum response likely to be recorded for the variable. Where possible the use of code zero should be avoided, since on many computer systems and in many statistical software packages it is impossible to distinguish zero from a blank response, meaning no data.

I - Multiple response questions

Multiple response questions require special consideration. They can be dealt with in two different ways. For example, in rural West Africa a family may use one or more of eight possible sources (rainwater, borehole, well, spring, river, lake, pond, stream) for their drinking water. The first method is to assign a separate coding box to each possible response, in this case eight. Each box would contain either code ‘1' for source used or ‘2' for source not used. Thus if a family used rainwater and also collected water from a river, the rainwater box (number 1) and the river box (number 5) would contain code 1, while the other six boxes would contain code 2. The second method is to instead assign codes 1 to 8 to the eight different sources and to decide on the maximum number of responses any family is likely to give. Suppose we decided that three responses was the limit. We would then allocate three separate coding boxes to the question and enter in these the code numbers of the sources named. The family using rainwater and river water would be coded as 1 (rainwater) in box 1 and as 5 (river) in box 2. Box 3 would be left blank. The codes should be entered either in numerical order, as done here, or in some other logical order, such as amount of usage of source.

J - Data checking

Each questionnaire should be carefully checked after completion, and again once the data have been entered onto the computer. The importance of this should not be overlooked. Checking should take place as soon after data collection as possible in order to allow the maximum chance of any resulting queries being resolved. Checks are basically of two sorts, range checks and consistency checks. Range checks exclude, for example, the erroneous occurrence of code 3 for sex, which should only be code 1 (male) or 2 (female). Consistency checks detect impossible combinations of data such as a pregnant man, or a 3-year-old weighing 70 kg. Scatter diagrams showing the relationship between two variables are particularly helpful in doing this, as they allow odd combinations of, for example, height, and weight to be easily spotted.

Three basic precautions are recommended to minimize errors occurring during the handling of data. The first is to avoid any unnecessary copying of data from one form to another. The second is to use a verification procedure during data entry. Data are entered twice, preferably by two different persons, as this gives an independent assessment of any poorly written figures. The two data sets are then compared and any discrepancies resolved. The third is to check all calculations carefully, either by repeating them or, for example, by checking that subtotals add to the correct overall total. When using a computer, all procedures should be tried out initially on a small subset of the data and the results validated by hand.