Public Health Action in Emergencies Caused by Epidemics (WHO  OMS, 1986, 285 p.) 
A2.1 Compilation of data
Unless the number of cases is very small, about 12 or fewer, it is difficult to assemble case data directly from forms A, B, and C (see Tables 25, 26, and 29). Some method of summarizing the most important facts is needed if patterns of occurrence are to be demonstrated. If an electronic computer is available, individual case data can be coded on form A and entered on to a computer tape or disk, for later retrieval. Even where computer facilities and personnel are immediately available, the investigator will probably need and want to update, change and reexamine the data and compilations frequently, and the use of a computer must be supplemented by less formal methods. If a computer is not available, other compilation approaches are essential.
A2.1.1 Linelisting of cases and preparation of handsorted cards
The first step in summarizing data is to prepare a linelisting of all cases so as to provide a permanent record. For this purpose, the most important data are selected and presented simply and clearly so as to facilitate compilation. Exactly what is “most important” will differ according to the nature and circumstances of the outbreak and the objectives of the investigation. It will always be necessary to list age, sex, locality and date of onset of disease, but there may be other details of equal importance. An example (not a universal model) of a linelisting of cases is shown in Table A2.1.
Cases are listed as they are reported, in numerical, not chronological, order. If numbers have previously been assigned to cases, an additional column must be provided in which they can be recorded. Many items of valuable information on forms A, B, and C are not indicated in this summary. Depending on the purpose, the linelisting may have to be extended or other lists created.
Table A2.1. Linelisting of cases in a dysentery epidemic
      
Signs/symptoms 
Laboratory tests*  
Serial No. 
Age (years) 
Sex 
Village 
Occupation 
Water supply 
Date of onset 
Diarrhoea 
Fever 
Bloody stool 
Vomiting 
Abdominal pain 
Death 
None 
Faecal exudate 
Agent isolation 
Severity 
Diagnostic level^{b} 
001 
18 
M 
G 
Farming 
Pond 
3/9 
+ 
+ 
+ 

+ 
+ 

+ 

S 
P 
002 
2 
M 
C 
 
Pond 
5/9 
+ 
+ 
+ 
+ 
+ 


+ 
Sh. dys. 
S 
C 
003 
30 
F 
D 
Weaving 
Shallow well 
18/8 
+ 

+ 


+ 
+ 


S 
S 
004 
25 
M 
A 
Fishing 
Deep well 
2/9 
+ 
+ 


+ 

+ 


M 
S 
005 
0.5 
F 
F 
 
Pond 
6/9 
+ 
+ 
+ 




 
 
S 
S 
etc.  
               
^{a} If no tests are performed, insert + in the “None” column.
Faecal exudate: if examined, record result as + or .
Agent isolation: if attempted, record result as  or give name of agent.
^{b }C = confirmed; P = presumptive: S = suspect.
If there are a large number of cases, preparing tabular summaries directly from the linelisting may be both tedious and prone to error. It is advisable, therefore, to prepare handsorted cards. Standard index cards measuring 8 cm × 12 cm are readily available and will conveniently hold 9 items of information; larger cards may be used if there are more than 9 items. One card is made out for each case, and the data from the linelisting are transferred to cards, as shown in Fig. A2.1 for the first case in Table A2.1.
The handsorted cards may be used to make any tabulations desired. For example, the cards may be sorted into two piles by sex, and the male and female subsets then further subdivided by agegroup. After the subset cards have been counted and recorded in a table, the cards can be reassembled and used for other tabulations. When all the tabulations needed at a particular time have been made, the cards can be put back into numerical order so that any individual case can be located when required.
Fig. A2.1. Handsorted cards
A. Model for handsorted card set
Age 
Locality 
Sex 
  
Date of onset 
Serial No. 
Severity 
  
Occupation 
Water supply 
Diagnostic level 
B. Handsorted card for first case
18 
Village G 
M 
  
3/9 
001 
S 
  
Farming 
Pond 
P 
WHO 851020
As new cases are reported, they are added to the linelisting and additional handsorted cards are prepared. Similarly, if new or revised information is received on cases already recorded, changes can be made in both records. New and revised tabulations may then be made as the investigation progresses.
A2.1.2 Incidence (attack) rates by personal characteristics
Tables of incidence rates should be prepared for all the subgroups considered to be relevant to the disease under investigation and the circumstances of the outbreak. Table A2.2 shows a simple example, based on a hypothetical epidemic of dysentery.
It will be seen from Table A2.2A that attack rates for males greatly exceeded those for females, and that the disease occurred most frequently among young adults. Table A2.2B shows the attack rates for the various occupational groups and demonstrates that the epidemic was concentrated among fishermen.
Table A2.2. Attack rates in a hypothetical epidemic of dysentery
A. By age and sex
Age (years) 
Males 
Females  

Population 
Cases 
Rate per 1000 
Population 
Cases 
Rate per 1000 
£10 
1500 
5 
3.3 
1400 
4 
2.9 
1019 
1200 
20 
16.7 
1200 
5 
4.2 
2039 
1000 
30 
30.0 
800 
8 
10.0 
³40 
1000 
10 
10.0 
800 
2 
2.5 
Total 
4700 
65 
13.8 
4200 
19 
4.5 
B. By occupation in males aged 10 years or over
Occupation 
Estimated population 
Cases 
Rate per 1000 
Farmers 
1800 
15 
8.3 
Fishermen 
900 
40 
44.4 
Artisans 
200 
1 
5.0 
Schoolchildren 
100 
1 
10.0 
Others 
200 
3 
15.0 
Total 
3200 
60 
18.8 
The age groups used in Table A2.2 are suitable because preliminary examination of the case data indicated that there were only a few cases among children and very old people. If cases had been concentrated among the very young or the elderly, different agegroups would obviously have been selected. In this example it was necessary to estimate the size of the different occupational groups, and in practice sufficiently detailed census data will rarely be available; the best information available locally may therefore have to be used. Finally, in epidemics of other diseases, very different personal characteristics may have to be used in the analysis.
Table A2.2 includes all cases reported in the investigation, regardless of the degree of certainty with which they were diagnosed. By definition, a “confirmed” case implies a reliable diagnosis whereas “suspect” cases imply some degree of doubt as to its correctness. If there are enough cases, each of the three diagnostic levelsconfirmed, presumptive, and suspectcan be tabulated separately. If the distribution of the cases by age, sex, place, time and other characteristics is dissimilar (because the “suspect” group includes cases of some other diseases), analysis should be limited to the confirmed or presumptive cases; if all groups appear to be similar, data can be combined, as in Table A2.2.
A2.1.3 Incidence (attack) rates by place
To examine whether the cases among fishermen were concentrated in certain fishing villages, the handsorted cards are sorted by locality, and tables similar to Table A2.2 prepared to show distribution by place. However, the analysis should be carried one step further. If all the cases of dysentery among fishermen were found in villages A and B, this may mean either that there are no fishermen in villages C and D, or that fishermen in those villages were unaffected. In order to decide which of these alternatives is correct, population counts must be made in places not affected by the disease in order to determine whether they had zero attack rates or merely contained no persons in the occupational group concerned.
The location of cases can readily be seen by preparing a “spot map”, as shown in Fig. A2.2, again based on the hypothetical dysentery outbreak used as an example in Table A2.1. A spot is used to indicate one or more cases and, in this instance, distinctive symbols are used to differentiate between fishermen and others. Other spot maps could be prepared, with other symbols, to show sex or age distribution, onset during particular periods of time, etc.
Fig. A2.2. Spot map showing
occurrence of dysentery cases in villages of subdistrict
WHO 851019
A2.1.4 Distribution of cases in time
The third and equally important epidemiological characteristic is time distribution of cases. The handsorted cards may again be used to put the cases into chronological order, and tables similar to Table A2.2 can be prepared to show attack rates during various time intervals. Graphs, however, are even more effective for showing the distribution of cases in time.
The simplest and most useful graph for this purpose is a histogram in which each case is represented by a box on graph paper with the horizontal axis indicating a convenient time unit (a single day, two days, one week, etc.). This may be used as a “working” graph, begun with the first case reports and kept up to date as new cases are notified. Fig. A2.3, based on the same dysentery cases as those recorded in Table A2.1, shows cases where the onset occurred during the 1month period that included the epidemic.
Fig. A2.3 shows three types of case: those in fishermen, those in the families of fishermen, and those in other persons. Many other characteristics of cases can be shown on a graph, e.g., the degree of diagnostic certainty of the cases (confirmed, presumptive, suspect); occurrence in different areas; occurrence by age, sex, and ethnic or occupational group; severity (survival, sequelae, death); the introduction of control measures, etc.; however, if too many details are shown, clarity will be lost.
Fig. A2.3. Dysentery cases in
subdistrict (18 August17 September 1982)
WHO 851021
Examination of Fig. A2.3 reveals a number of important facts: (1) endemic cases of dysentery had been reported in this district almost every day; (2) the epidemic began suddenly on 1 September and ended on about 11 September; (3) the earliest epidemic cases were among fishermen, but cases followed quickly in their families and other members of the communities. This information will be useful in developing hypotheses as to the origin and development of the epidemic.
A2.2. Formulating and testing hypotheses of causation
The control of an epidemic requires a plan of attack, and such a plan must be based on the most plausible explanation of the origin and continuation of the epidemica hypothesis of causation. In the absence of such a hypothesis, control activities may be disorganized or misdirected; they may not be given the appropriate order of priority, and it may be difficult to evaluate their effectiveness if targets have not been established. It must be emphasized, however, that a hypothesis is merely a tentative explanation based on currently available information, to be rejected or modified, as necessary, as additional information is accumulated, or changed if the pattern of transmission changes. The officer in charge of control activities must therefore continuously reexamine the hypothesis, and must be prepared to adapt activities to changes in views as to the nature of the outbreak.
A hypothesis of causation is formulated on the basis of all the information available on the outbreak: clinical features, laboratory diagnostic studies, epidemiological patterns, the results of environmental and ecological surveys and assessments (including vector and reservoir studies), and whatever additional information an experienced and imaginative investigator may be able to gather about the movements of people, changes in activities, imports, environmental and climatic disturbances, etc. The main point of interest, however, is always the diagnosis of the disease involved. If this can be established with certainty, standard works of reference can be consulted and the possible sources and transmission mechanisms identified. The next step in hypothesis formulation is epidemiological analysis.
A2.2.1 Determining the mode of transmission
For certain diseases, the mechanism of transmission is known; for example, it can safely be assumed that an outbreak of yellow fever is being propagated by infected mosquitos, and that measles is being transmitted by the respiratory route, persontoperson. Even with such diseases, however, further information may be necessary, e.g., with yellow fever it may be essential to know the focal distribution and particularly to explain the origin of the first focus, while with measles it may be necessary to determine the origin and pattern of spread. For other diseases, such as dysentery, both commonsource contamination and persontoperson transmission are of major importance, and their specific role in any particular epidemic must be elucidated. With a disease of unknown etiology, as was the case with Lassa fever at the time of its first appearance, no guidance for the investigation is initially available.
Descriptive data of the type described above are used to obtain tentative answers to questions concerning origin and propagation. The recommended procedure is to examine each table, graph or map separately at first, for two purposesto interpret its possible meaning, and to identify any missing information or additional detail that should be sought. In yellow fever, for example, the presence of a sex differential in reported cases will suggest differing exposures of men and women; this in turn suggests that information on occupation should be obtained, and hence possibly on the localities where those concerned were employed. In measles, separate tabulation of cases by immunization status means that agespecific immunization records will have to be obtained (in order to calculate attack rates for immunized and unimmunized children), together with information on attendances at school or other gathering places (in order to search for foci of transmission), and on contacts (in order to trace chains of transmission and, for control purposes, as a guide to an emergency immunization programme).
Where a disease may be propagated either by commonsource exposure of large numbers of people or by persontoperson spread, first priority in an investigation must be given to determining which pattern best explains the known cases. The distribution of case onsets over time may provide the first and best clue while the shape of the epidemic curve may also be helpful. Fig. A2.4 shows a number of typical (and stylized) curve patterns characteristic of various types of exposure. In all such curves the number of cases is shown on the vertical axis and the passage of time (measured in hours, days, or weeks) on the horizontal axis.
Fig. A2.4. Typical shapes of
epidemic curves
WHO 851018
Curve I represents a simple commonsource epidemic. Since all the cases are the consequence of a single exposure over a particular period of time, e.g., to toxincontaminated food (with the case onsets extending over a period of hours) or as a result of the temporary pollution of a community water supply (over a period of days for dysentery or of weeks for hepatitis A), they must all have had their onset during a period of time (AB) which lies within the usual range of incubation periods for the disease concerned. For example, in typhoid fever, where the great majority of cases have incubation periods of 721 days, the interval between A and B should not be more than 15 days if all exposures occurred on the same day. In actual practice, however, when a hypothesis of causation is being formulated, it is necessary to reason backwards in time from current observations. When a curve resembling Curve I is plotted for a disease diagnosed as typhoid fever, if the interval AB is 15 days or less it may be hypothesized that one brief common exposure could account for all cases.
For a disease such as typhoid fever, commonsource cases often pass the infection on to contacts by persontoperson transmission, so that a limited number of secondary cases may develop. The epidemic will then continue for an additional period of time (BC in Curve II). The time represented by B is, of course, unknown but if the first 7580% of the cases in a typhoid epidemic occurred within the abovementioned period of 15 days, commonsource exposure followed by secondary spread may be hypothesized.
Curve III shows a somewhat different situation, namely that of an epidemic continuing for a longer period of time (AD) and ending either at the preepidemic level of endemicity or at some higher level. If such an epidemic began with commonsource exposure, either it was followed by uncontrolled persontoperson transmission completely obscuring the time point B or the common source of exposure was continuously present over a long period of time. A curve of this shape is not very helpful in hypothesis formulation and clues must be sought elsewhere.
Finally, Curve IV shows the slower, gradual buildup of cases in an epidemic that does not originate from a common source of exposure. A wavelike pattern may sometimes be seen early in the course of the outbreak, with the “waves” representing successive “generations” of transmission; the interval between two crests is the average incubation period. An epidemic curve such as this may be found with enteric and respiratory diseases and with vectorborne diseases in which man serves as the reservoir.
A2.2.2 Estimating the date of commonsource exposure
If it has been decided that an epidemic probably resulted from commonsource exposure, the next stage is to estimate when that exposure could have taken place. The procedure is a simple one, and is illustrated in Fig. A2.5, which shows the distribution of cases in a typhoid fever epidemic with 81 cases; the epidemic began on 24 March and ended on 4 April, its total duration being 12 days. As this is within the usual range of typhoid incubation periods (15 days), a commonsource outbreak can be hypothesized. The “minimummaximum” method of estimation shows that, if all cases had been exposed on the same day, the first two cases must have become infected not less than 7 days earlier (the minimum incubation period), i.e., before 17 March. The last case must then have been the one with the longest incubation period (21 days) and therefore could not have been infected earlier than 14 March. A single common exposure some time between 14 and 17 March could thus account for all the cases. The hypothesis that exposure took place at some time during this limited period provides valuable guidance in searching for the event that caused the exposure.
An alternative method is to make use of the average incubation period. Taking the date of onset of the median case (41st in chronological order in this outbreak of 81 cases), and counting back 14 days (the average incubation period of this disease), 15 March is found to be the approximate date of the common exposure.
Fig. A2.5. Estimating date(s) of
possible exposure in a commonsource outbreak
WHO 851022
Occasionally it is possible to find cases among people who have visited the area of the epidemic for a brief period of time and then gone away, so that the onset of the disease takes place in an area in which no other case has occurred. Epidemics originating at fairs or festivals may produce many such cases. The interval between the brief exposure and the onset of the disease reflects the incubation period, and the date(s) on which the person concerned resided in the epidemic (or suspect) area define (s) the period during which infection took place.
If an epidemic curve resembles Curve II in Fig. A2.4, the problem of estimating the date(s) of common exposure is more complicated because the time point B is unknown. It may be possible, however, to identify commonsource cases by careful examination of contact histories. If the second and subsequent cases in individual households are separated from the first by the length of an incubation period, they may be assumed to be contact cases. The first case in every household may then be plotted on a separate graph, and the date of common exposure estimated from this. In the dysentery epidemic described in Section A2.1.2 and plotted in Fig. A2.3 this procedure was followed and cases were divided into three categories: fishermen, the families of fishermen, and others in the villages. The entire outbreak, shown in Fig. A2.3, lasted for 11 daystoo long for it to be entirely commonsource in character, since the incubation period for dysentery has a range of only 17 days and is most commonly 13 days. The onset of cases among fishermen, however, occurred only over a period of 4 days (from 1 to 4 September) and they could therefore have been exposed to a common source. The estimated date of that exposure would be some time between 28 and 31 August, and most probably on the latter date.
A2.2.3 Casecontrol studies to identify specific causes
A hypothesis of causation may be adequate as a basis for control activities even when derived only from analysis of descriptive data and a comparison of attack rates among identifiable population subgroups, as shown previously. For example, if yellow fever is occurring only among adult male woodcutters in a South American forest, and not among other groups, it is reasonable to conclude that mosquito transmission is occurring only in the forest. Sometimes, however, differences in attack rates are not sufficient to determine the source of exposure precisely because it is not clear how to subdivide a particular group into relevant subgroups. A simple casecontrol study may then be helpful.
Thus the dysentery outbreak referred to in section A2.1.2 affected 40 of the 900 people who claimed to be fishermen. A brief inquiry may reveal that only 100 were fishing during the suspect period of 2831 August, including all those who became ill. Further subdivision of this group by place or activity may not be feasible, and another epidemiological approacha casecontrol studyis needed to identify a possible specific exposure. If a questionnaire is drawn up covering food and water consumption each day during the suspect period, it will be possible to obtain and compare information on these (or other) exposures from cases and noncases (the “controls”) among those who were fishing. A possible (somewhat oversimplified) result of such an investigation is shown in Table A2.3.
If feasible, as many as possible of the cases should be interviewed; if there are very many cases, a representative sample should be selected for interview, together with at least as many controls. The numbers need not be equal, however. For Table A2.3 it was assumed that all the men concerned could answer all the questions; if not, the percentages of respondents who answered “yes” would have to be calculated. The table shows only one activity in which a substantially greater proportion of cases than controls had engaged, namely drinking pond water at place B. If it is confirmed statistically that the difference between these proportions is unlikely to have occurred merely by chance (see Section A2.2.4) it can be hypothesized that this pond was the source of infection.
Table A2.3. Activities indulged in by cases and controls among fishermen between 28 and 31 August^{a}

Cases 
Controls  
Activity 
No. 
% 
No. 
% 
Brought food from home 
32 
80.0 
39 
70.9 
Purchased food at place A 
36 
90.0 
50 
90.9 
Purchased food at place B 
30 
75.0 
45 
81.8 
Ate fish caught in river 
31 
77.5 
41 
74.5 
Drank river water 
39 
97.5 
55 
100.0 
Drank pond water at place B 
35 
87.5 
17 
30.9 
Drank wellwater at place A 
33 
82.5 
48 
87.2 
^{a} 40 cases and 55 controls.
Many other casecontrol inquiries can be made, limited only by the experience, skill, and imagination of the investigator. The basic principle is that the study is started by selecting a case group and a control group that are comparable and had an equal chance of being exposed. Questions are then asked about exposures relevant to the disease and the circumstances. The proportions of “cases” and “noncases” that experienced the exposure are compared and statistically significant differences noted.
A2.2.4 Statistical assessment
When attack rates or proportions (in casecontrol studies) are being compared, the objective is to determine whether there are real differences between various population groups of interest. Since differences may occur by chance alone, it is necessary to have some method of estimating the probability that the differences could have occurred by chance or, conversely, that they are unlikely to have occurred by chance and are therefore likely to be real. The method used is the calculation of statistical probability, commonly referred to as “statistical significance”.
Two approaches are available, the most frequently used being the direct calculation of the statistical probability that differences as great as those found could have occurred by chance. The result is expressed as a probability proportionthe number of times out of 100 or 1000 that a difference of the magnitude observed could have occurred by chance. This is written, e.g., as “P = 0.13”, “P = 0.03” or “P = 0.003”, meaning respectively 13 times out of 100, 3 times out of 100, and 3 times out of 1000. It has become the convention to consider P = 0.05 as the point that separates differences that are unlikely to be chance variations (“statistically significant”, and therefore possibly meaningful) from those that might easily have occurred by chance (and are therefore “not statistically significant”). That is, a “Pvalue” of 0.05 or less (a chance event expected to occur 5 times in every 100 or less frequently) suggests a real difference, while one greater than 0.05 provides no strong evidence that the difference found is other than a chance happening. This arbitrary dividing line should not be treated as an absolute, however, and values above it should not cause a hypothesis to be rejected out of hand nor should one below it cause a hypothesis to be accepted blindly. Instead, the statistical evidence should be weighed together with all other information, and a conclusion reached accordingly.
For an epidemic control officer who is not well trained in statistical methodology a simpler but very useful approach is available. If the conventional 0.05 level of probability is accepted as a reasonable dividing line between differences that are likely to be chance variations (values above 0.05) and those that are likely to be meaningful (values of 0.05 or lower), simple formulae can be used to show whether two rates or two proportions differ “significantly”. These are given in section A2.3, together with guidelines for their use and interpretation.
Although the question is not discussed at length in this manual, the disease control officer is often interested in determining the prevalence of some characteristic in a population, e.g., he may wish to determine the prevalence of amoebic cysts, of BCG immunization scars, or of households with vessels in which mosquitos are breeding. After selecting a representative sample of people or households, and making the necessary investigations, he calculates the rate. The rate obtained in the sample is obviously only an estimate of the true rate in the entire population, and will vary from sample to sample. Furthermore, the precision of the estimate will depend on the size of the sample, just as 10 tosses of a coin may easily produce 4, 5, or 6 “heads”, but 100 tosses are likely to produce close to 50% “heads”. Some statistical method of assessing the precision of a rate found by such an investigation is therefore necessary. The statistical calculation of “confidence intervals” provides a range of values that define the upper and lower limits within which the true population prevalence may confidently be expected to lie. A simple procedure for calculating a confidence interval is described in section A2.3.
A2.3 Statistical analysis
Most investigations conducted by an epidemiologist are concerned with the rate of occurrence of a particular phenomenon in a group of individuals under observation. It is helpful to distinguish between surveys to determine the rate of phenomenon X in population Y and descriptive epidemiological studies designed to discover whether there is a difference between the rates of occurrence of phenomenon X in populations Y and Z. In the first instance the statistical approach is to define a “confidence interval” for the rate, in the second, to determine whether the difference in the observed rates is likely to be a chance occurrence or not. Each of these approaches will be examined in turn.
Much of the work that a statistician does when analysing rates can be understood by any epidemiologist willing to study and make use of two concepts: the standard normal score and a statistical value often called Pearson’s chisquared statistic.
Onerate studies will be considered first here. The statistical formulae to be used in twosample studies are given on the next pages.
A2.3.1 Confidence intervals
Suppose the specific question to be answered is: “What is the rate of infection among all adult males exposed to virus X?”. Since it is not feasible to study all such males, a representative sample of those exposed to the virus is selected, and the rate of infection in that sample is determined. It would be naive to assume that the rate found in this particular group is exactly equal to that for all exposed adult males. Instead, the statistician usually defines a 95% confidence interval to indicate the range within which the true overall rate is likely to lie.
If the true (but unknown) overall rate is called R, and the observed rate r, the confidence interval is given by the following formula:
_{}
where n is the number of men in the sample.
For example, if 100 exposed men constitute the study group, and 30 become infected, then r = 0.3 and n = 100. The 95% confidence interval is then:
_{}
_{}
_{}
The lower boundary, therefore, is 0.30.090 = 0.210 and the upper boundary is 0.3 + 0.090 = 0.390. The statistical interpretation is that 95 out of 100 confidence intervals established in this fashion will include the true value for R, the rate in the entire population of adult males. In nonstatistical terminology, the epidemiologist interprets this result to mean that, while 30% is the best estimate of the infection rate among all men, he is confident that the true proportion of infected men in the community lies between 21% and 39%.
It should be noted that the confidence interval depends markedly on the size of the sample studied. If n were only 50 instead of 100 as above, the result would be 0.3 ±0.127 (i.e., 0.1730.427), and if it were 500 the result would be 0.3 ±0.040 (i.e., 0.260.34). A large sample provides a far better estimate than a small one.
It is also important to note that the figure of 1.96 used in the formula is a standard normal score. The use of this value (for a 95% confidence interval) is a common convention, but is not always appropriate. If n is small and the rate is either very high or very low, a much more complicated formula is required and the assistance of a statistician may be needed. A quick rule of thumb is that 1.96 can safely be used if the sample contains at least 5 people who have experienced the phenomenon under study (i.e., were infected in this example) and 5 who have not experienced it (i.e., were not infected).
A2.3.2 Significance test
It is useful to distinguish between “singlegroup” studies and “multiplegroup” studies. In a singlegroup study, the rate in the study group is compared with some rate established from information available before the study begins. In a multiplegroup study, the rates for the two or more groups included in the study are compared with each other.
In both single and multiplegroup studies, the individuals included in the study are thought of as a sample taken from some larger population. The rate observed in a study group is considered to be an estimate of the actual rate for the entire population having similar characteristics and exposures. The epidemiologist therefore asks: “Is the difference between the observed rate and the ‘established’ rate for the entire population (or between the two observed rates) a real difference, or could it be merely a chance variation?” In other words, a “significance test for the difference between rates” must be made.
Singlegroup studies. Suppose that it is known that the rate of phenomenon X is 60% (i.e., the proportion is 0.6) in a general population. This rate has been found and confirmed in a number of settings. A need then arises to determine the rate of phenomenon X (which might, for example, be the malaria parasite rate) in some subpopulation of interest (the study population). A sample of 2000 persons from this study population is examined and a rate for malaria parasitaemia is obtained. This observed rate can be expected to differ somewhat from 60%. Does the difference between the observed rate and 60% indicate that the rate for the entire study population is different from 60%, or does it merely reflect the kind of variability that can be expected as the result of chance?
The appropriate test of significance in this problem is the calculation of Pearson’s chisquared statistic. The symbol used is _{} and the calculation is as follows.
Step 1. Calculate how many persons in the sample could be expected to have malaria parasites, on the assumption that the rate is the same in the study population as it is for the general population. In the example given, this expected number, E, will be equal to 1200 (0.6 times 2000).
Step 2. Determine the difference between the number of persons “expected” to have malaria parasites and the number actually observed. Then, ignoring the direction of the difference (i.e., whether it is negative or positive), reduce this figure by 0.5, and call the result D. In the example, suppose that 1300 of the 2000 people examined had malaria parasites. The difference between the expected and observed figures is 100; if this is reduced by 0.5, a value for D of 99.5 is obtained.
Step 3. Now calculate Pearson’s chisquared statistic from the formula:
_{}
where n is the total number of people in the sample.
In the example given, n = 2000, E = 1200 and D = 99.5, so that
_{}
Step 4. Compare the resulting value of _{} with the number 3.84. A sample can be expected to yield a value of _{} as large as 3.84 by chance 5 times out of 100, if the true rate in the study population is equal to that in the general population. Since the value for the sample exceeds 3.84, it can therefore be concluded that there is a real difference between the rate for the study population and that for the general population. In statistical terminology, since the value exceeds 3.84, the result is statistically significant at the 0.05 level.
Again, it is important to note the effect of sample size on the result. If the sample had included only 200 people instead of 2000, and the same percentages had been obtained (i.e., n = 200, E = 120, observed value = 130, and D = 9.5), _{} would have been 1.89. This would have been “not statistically significant at the 0.05 level”. It is possible to estimate in advance the size of the sample needed to demonstrate an expected possible result, but the assistance of a statistician may be required.
It must also be emphasized that the entire procedure described is appropriate only when the sample size is large enough for both E and n  E to be greater than or equal to 5. If this is not the case, a more complicated analysis will be required and a statistician should be consulted.
Twosample studies (unmatched). In studying the rate of occurrence of phenomenon X in population Y, it may be essential to compare rates for X between subgroups within that population (as in the dysentery epidemic discussed in section A2.1), e.g., for males as compared with females, exposed with unexposed persons, people over 40 years of age with those 40 years old or younger, etc. Once again, even if the rates are identical for all groups within population Y, it would not be reasonable to expect the rates actually found in the study groups to be identical. It is again necessary to be able to determine when differences in the observed rates reflect real differences in subpopulation rates and when they are likely to be due simply to chance. If members of the two subgroups are chosen independently of each other (i.e., without matching) the analysis involves calculation of _{} for a 2 × 2 table.
The casecontrol study described in section A2.2.3 may be taken as an example. The 40 dysentery cases and 55 controls were questioned about food and water consumption while they were fishing, and it was found that the two groups had differed considerably in the extent to which they had drunk pond water at place B. Cases and controls had not been matched. The results were displayed as percentages in Table A2.3, but for chisquared analysis a different arrangement is required, as shown below:
Did you drink pond water? 
Cases 
Controls 
Total 
Yes 
35 = a 
17 = b 
52 
No 
5 = c 
38 = d 
43 
Total 
40 
55 
95 
The observed rates for drinking pond water  87.5% for cases and 30.9% for controlsseem to be very different, but do they indicate a real difference between these groups, or could they have been due to chance? To find out, Pearson’s chisquared statistic is calculated, as follows:
Step 1. Find the product a × d (35 × 38 = 1330) and the product b × c (17 × 5 = 85).
Step 2. Subtract the smaller product from the larger (1330  85 = 1245) and call the difference P_{1}.
Step 3. Find the product of all subtotals (40 × 55 × 52 × 43 = 4 919 200) and call it P_{2}.
Step 4. Calculate Pearson’s chisquared statistic from the following formula:
_{}
where n equals the total number in the study.
In the example, P_{1} = 1245, P_{2} = 4 919 200 and n = 95, so that:
_{}
Step 5. Compare the value of _{} with the number 3.84. When the subpopulation or group rates are equal, samples of the size used can be expected to result in values of _{} as large as 3.84 by chance about 5 times out of 100. Since the value calculated is well above 3.84, it is concluded that the difference observed is unlikely to have occurred by chance alone and that the cases and controls really did differ in the extent to which they drank pond water at place B. In statistical terminology, a significant difference has been found between cases and controls at the 0.05 level of significance. The conclusion would have been quite different if the casecontrol differences for the characteristic “brought food from home”, as shown in Table A2.3 had been tested. There the record showed that 32 out of 40 cases (80.0%) had brought food from home and 39 out of 55 controls (70.9%) did so. Calculating Pearson’s chisquared statistic as above gives a value of _{} equal to 0.589. Since this is below 3.84, it can be concluded that the difference between cases and controls could well be a chance variation.
It must be emphasized that the analysis just described is not appropriate if the number of people observed is small and rates are extremely high or extremely low. A rule of thumb is that, in the combined group of cases and controls (or males and females, etc.), at least 10 persons must be observed to exhibit the phenomenon and at least 10 persons must be observed not to do so. If these figures are not reached, other and more complicated statistical procedures will be needed, and the help of a statistican must be sought.^{1}
^{1} The analyses descried in this Annex and the formulae used have been developed by Professor Dale E. Mattson, for use in the restricted circumstances described. For a detailed discussion of the general use of Pearson’s chisquared test, see: Mattson, D. E., Statistics: difficult concepts, understandable explanations, Chicago, Bolchazy Carducci Publishers, 1984 (chapter 9, lesson 3).
Twosample studies (paired data). In the preceding section, Pearson’s chisquared statistic was used on the assumption that the groups being compared were independent of one another. If males were compared with females or cases with controls, for example, it was assumed that each group represented all the components of its subpopulation. Sometimes, and particularly with relatively small samples in casecontrol studies, this assumption cannot be made. For example, in the analysis of the dysentery outbreak, it was assumed that the fishermen were a fairly homogeneous group and that the 40 cases and 55 controls were reasonably representative. If, however, the cases in a different outbreak occurred among members of three socially isolated clans, which had distinctive customs and habits and, furthermore, if different age groups traditionally lived and worked separately from each other, it could not be assumed that the 40 cases and the 55 controls selected at random would adequately represent the subgroups. To overcome the danger that cases and controls might come from different subgroups, and might therefore have different exposure histories simply because of that fact, the procedure known as matching can be employed, whereby for each case identified as belonging to a particular clan and age group, a control belonging to the same clan and age group would be selected. There would then be 80 subjects in the study, representing not two independent groups  cases and controlsbut arranged as 40 pairs of subjects, the members of each pair differing from each other only in that one is a case and the other a control. In this situation, Pearson’s chisquared statistic is inappropriate. Instead, McNemar’s chisquared test for correlated proportions may be used. The symbol _{}will be used for the corresponding statistic.
Suppose that there are 40 casecontrol pairs in a study constructed as just described. When each pair is questioned about drinking pond water, four results are possible:
(a) both drank pond water;
(b) the case drank pond water, but the control did not;
(c) the control drank pond water, but the case did not;
(d) neither drank pond water.
The answers to the questions may be tabulated as follows:

Control  

Yes, drank pond water 
No, did not drink pond water  
Case 
Yes 
24 = a 
11 = b 

No 
3 = c 
2 = d 
The calculation of _{} depends only on the number of pairs with differing outcomes for cases and controls, i.e., cells b and c. The steps are as follows:
Step 1. Count the number of pairs in which the case exhibits the phenomenon in question (drinking pond water) and the control does not. Call the result b. In the example, b = 11.
Step 2. Count the number of pairs in which the control exhibits the phenomenon and the case does not. Call the result c. In the example c = 3.
Step 3. Subtract b from c or c from b, depending on which is the smaller. In the example, since c is the smaller, it is subtracted from b (11  3 = 8). Call the result P_{1}.
Step 4. Add b and c to obtain P_{2}. In the example, P_{2} = 11 + 3 = 14.
Step 5. Calculate _{} from the following formula:
_{}
In the example:
_{}
Step 6. Compare _{} with the number 3.84. Once again, if the calculated value of _{} is equal to or greater than 3.84, it can be concluded that the rates for the subpopulation represented by cases and for that represented by controls are different. In the example, even though 35 cases had drunk pond water and only 27 controls had done so, the value of _{} is less than 3.84 and it is therefore concluded that the difference could have been a chance variation (although it is a borderline result). In statistical terminology, the difference in rates (of drinking pond water) is not significant at the 0.05 level.
Once again, the analysis just described does not apply to all situations. It is appropriate only when sample sizes and rates are large enough to result in a value for P_{2} of at least 10. If P_{2} is less than 10, the solution is much more complicated and an epidemiologist without statistical training will need to seek the advice of a statistician.
BIBLIOGRAPHY
Guidelines on studies in environmental epidemiology. Geneva, World Health Organization, 1983 (Environmental Health Criteria No. 27).
MATTSON, D. E. Statistics: difficult concepts, understandable explanations. Chicago, Bolchazy Carducci Publishers, 1984.