![]() | Medicine - Epidemiology (ECHO - NOHA - Network on Humanitarian Assistance) (European Commission Humanitarian Office, 1994, 120 p.) |
![]() | ![]() | Chapter 1: Epidemiology and biostatistics |
![]() |
|
A - Types of data
Raw data of an investigation consist of observations made on individuals. In many situations the individuals are people, but it needs not to be. For instance, they might be red blood cells or hospitals. The number of individuals is called the sample size. Any aspect of an individual which is measured, like age, weight or sex, is called a variable.
It is often useful to distinguish between three types of variables: qualitative, discrete, and continuous. Discrete and continuous variables are often called quantitative. Qualitative (or categorical) data arise when individuals fall into separate classes. These classes may have no numerical relationship with one another, e.g. sex : male, female; eye colour: brown, grey, blue, green.
Discrete data are numerical, arising from counts. Their values are integers (whole numbers), like the number of people in a household or the number of cases of pertussis in a week. If the values of the measurements can take any number in a range, such as height or weight, the data are said to be continuous.
In practice there is an overlap between these categories. Most continuous data are limited by the precision with which measurements can be made. Human height, for example, is difficult to measure more precisely than to the nearest millimeter and is usually measured to the nearest centimeter. So, only a limited set of possible measurements is actually available, although the quantity “height” can take an infinite number of possible values. The measured height is really discrete. However, the methods described below for continuous data are as appropriate for discrete variables.
B - Frequency distributions
If data are purely qualitative, the simplest way to deal with them is to count the number of cases in each category.
The count of individuals in each category is called frequency of that category, for example, the frequency of death through earthquakes is 389,700. The proportion of individuals in this category, related to all deaths, is called the relative frequency. The relative frequency of deaths through earthquakes is 389,700/1,011,200 = 0.385 or 38,5 per cent.
If the categories are ordered, we can use another set of summary statistics, the cumulative frequencies.
The 2,115 victims of accidents were classified according to rough impressions based on their injury. Such a classification could be useful for planning the resources for medical help. The cumulative frequency of a value of a variable is the number of individuals with values less than or equal to that value. Thus, if we order grade of injury from slight to critical, the cumulative frequencies are 81; 521(= 81 + 440); 1,567 (= 81 + 440 + 1,046) etc.. The cumulative relative frequency of value is the proportion of individuals in the sample with values less than or equal to that value. For example, they are 0.038 (= 81/2115), 0.246 (= 521/2115) etc.. Thus, we can see that the proportion of victims with at the most serious injuries is 0.741 or 74.1 per cent.
This frequency distribution is not a very informative summary of the data, most of the values occurring only once. The cumulative frequencies are quite satisfactory.
To get a useful frequency distribution we need to divide the height scale into class intervals, e.g. from 155 to 160, from 160 to 165 and so on, and count the number of individuals in each class interval. The class intervals should not overlap, so we must decide which interval contains the boundary point to avoid it being counted twice. It is usual practice to put the lower boundary of an interval into that interval and the higher boundary into the next interval. Thus, the interval starting at 155 and ending at 160 contains 155 but not 160.
C - Histograms and other frequency graphs
The frequency distribution can be calculated easily and accurately by using a computer. Without using a computer data should be ordered from lowest to highest value before making the interval boundaries and counting. This is rather like starting from Table 3.
Graphical methods are very useful for examining frequency distributions. Figure 1 shows a graph of the cumulative frequency distribution for the height data. This plot is very useful for calculating some of the summary statistics presented later. The most common way of depicting a frequency distribution is by a histogram. This is a diagram where the class intervals are on an axis and rectangles with heights or areas proportional to the frequencies erected on them. The vertical scale shows the relative frequency of observations in each interval.
We often want to summarize a frequency distribution in a few numbers, for facilitating reporting or comparison. The most direct method is to use quantiles. The quantiles are sets of values which divide the distribution into a number of parts so that there are equal numbers of observations in each part. For example, the median is a quantile. The median is the central value of the distribution, so that half the points are less than or equal to it, and half are greater than or equal to it. If we have an even number of points, we choose a value mean between the two central values. For the height example, we have 22 values, so we have to take the middle bet-ween the two central values (11th and 12th of theorderd values) (176 + 178)/ 2 = 177 cm, which we easily get from the cumulative frequencies in . We can get any quantiles easily from the cumulative frequency distribution.
We often want to summarize a frequency distribution in a few numbers, for facilitating reporting or comparison. The most direct method is to use quantiles. The quantiles are sets of values which divide the distribution into a number of parts so that there are equal numbers of observations in each part. For example, the median is a quantile. The median is the central value of the distribution, so that half the points are less than or equal to it, and half are greater than or equal to it. If we have an even number of points, we choose a value mean between the two central values. For the height example, we have 22 values, so we have to take the middle bet-ween the two central values (11th and 12th of theorderd values) (176 + 178)/ 2 = 177 cm, which we easily get from the cumulative frequencies . We can get any quantiles easily from the cumulative frequency distribution.
In general, we estimate the q quantile, the value so that a proportion q will be below it, as follows: We have n ordered observations which divide the scale into n + 1 parts: below the lowest observation, above the highest and between each adjacent pair. The proportion of the distribution which lies below the i-th observation is estimated by i / (n + 1). We set this equal to q and get i = q (n + 1). If i is an integer, the i-th observation is the required quantile. If not, let j be the integer part of i, the part before the decimal point. Then we take the (j + 1)th observation as the q quantile. Other quantiles which are particularly useful are the quartiles of the distribution. The quartiles divide the distribution into four equal parts. For the height data the first quartile is 174 cm: i = 0.25 x 22 = 5.5. Therefore, the 1st quartile is the 6th observation which we get again from the frequency distribution. We often divide the distribution into centiles. For the 10th centile of height i = 0.1 x 22 = 2.2, so the 10th centile is the 3rd observation, 168 cm. We can estimate them from Figure 1 by finding the position of the quantile on the vertical axis, e.g. 0.1 for the 10th centile or 0.9 for the 90th centile, drawing a horizontal line to intersect the cumulative frequency graph, and reading the quantile off the horizontal axis.
A convenient figure summary of a distribution is the box and whisker plot, which uses the median, quartiles, maximum and minimum of the observation.
D - The mean
The median is not the only measure of central value for a distribution. Another is the arithmetic mean or average, usually referred to simply as the mean. It is found by taking the sum of the observations and dividing it by their number. For the height example the sum of all values is 3906 , so the mean is 3,906/22 = 177.5.
At this point we need to introduce some algebraic notation, widely used in epidemiology. We denote the observations by :
x1, x2,..., xi, ...xn
There are n observations and the i-th of these is xi.
The summation sign is an upper-case Greek letter, sigma, the Greek S. When it is obvious that we are adding the values of xi for all values of i, which runs from 1 to n,
The mean of the xi is denoted by x, pronounced ‘x bar', and x = 1/n - xi.
In this example the mean is very close to the median, 177. If the distribution is symmetrical the mean and median will be about the same, but in a skewed distribution they will not.
E - Variance and standard deviation
The mean and median are measures of the central tendency or position of the middle of the distribution. We shall also need a measure of the spread, dispersion or variability of the distribution.
One obvious measure is the range, the difference between the highest and lowest value. This is a useful descriptive measure, but is has two disadvantages. First, it depends only on the extreme values and so it can vary a lot from sample to sample. Secondly, it depends on the sample size. The larger the sample size, the further apart the extremes are likely to be.
The most commonly used measures of dispersion are the variance and standard deviation, which we shall describe now. We start by seeing how each observation differs from its mean. Table 6 shows the deviations from the mean of the 22 observations of height. If the data are widely scattered, many of the observations will be far from the mean, and so many deviations will be large. If the data are narrowly scattered, very few observations will be far from the mean and so few deviations will be large. We square the deviations and then add them, as shown in Table 6. This gives us :
In the example equal to 1,269.5. For an average squared deviation, we divide the sum of squares by (n - 1), not n.
The estimate of variability is called the variance, defined as follows:
The variance is calculated from the squares of the observations. This means that it is not in the same unit as the observations, which limits its use as descriptive statistic. The obvious solution is to take the square root, which will then have the same unit as the observations and the mean. The square root of the variance is called the standard deviation, denoted by s:
(s =_ variance )