|Counting and Identification of Beneficiary Populations in Emergency Operations (ODI, 1997, 110 p.)|
The following is included as a complement to information provided in Chapters 5 and 6 on methods of population estimation.11
An introduction to statistics - sampling and statistical inference
Statistics is the science of analysing data. It tells us how data can be collected, organised and analysed, and how to draw conclusions from the data correctly. Without statistics, it would be impossible to perform the calculations behind many familiar things such as political polls, the approval of new medicines, unemployment figures, etc.
In statistics, population does not only refer to people; it is used to mean any group about which you wish to make generalisations. Unless otherwise stated, in this Review, population refers to the beneficiary population of emergency operations. If you make a complete study of a population, i.e. collecting information on each individual within that population, you are taking a census. However, there are a number of situations in which it is necessary to take a sample rather than a census.
Sampling - the examination of only part of a population and then making assumptions about the total population based on the results of that sample), is the key to understanding statistics.
When collecting information on a given population or group, it is more likely that a sample will be used than a full census of all members of the population. Hypothesis testing is used to check that the right conclusion has been reached, based on the results of the sample. There are many statistical methods for testing a hypothesis, some of which are explained briefly in this chapter. For a deeper understanding of statistics a professional manual, or course is recommended.
Using the characteristics of the sample to generalise about a parent population is known as induction - if 3% of a sample possess a particular characteristic, it may be assumed by induction that approximately 3% of the total population possess the same characteristic. This is an important technique for estimating the total population in an emergency settlement, for example, based on information collected on only part of the settlement. A condition of such a method of representative sampling is that all members of the population have the same chance of being selected for examination.
Ensuring that the sample is representative can be harder than it seems. It is important to define the population as well as the sample. For example, sampling inhabitants of a particular emergency-affected area will tell you about the population of that area; it will not necessarily give you specific information about the beneficiaries of the emergency operation, because the inhabitants of the area will not necessarily be beneficiaries of the emergency programme. There are two kinds of statistical inference, estimation and hypothesis testing.
Every person or household in the population has an equal probability of being included in a random sample (e.g. picking names from a hat, or the blindfold and pin exercise to pick them from a list). To draw any valid conclusion, the sample must be representative of the whole population. For example, nutritional data obtained from health services are not representative of all the population. Nor are those collected in the most accessible villages or centres, or in camps that are reported to be in a particularly bad state.
Here, cases are selected at given intervals. For example, if 200 cases are to be selected at given intervals from a listed population of 10,000, one can select every fiftieth case.
Cluster or stratified sampling
Instead of selecting individual units as above, in this case the researcher divides the population into groups or categories, called strata or clusters (e.g. by location, ethnic origin, religion, gender, age etc.). By doing this, one can guarantee that certain priority groups are represented. The total population in that group may be surveyed, or random samples may be drawn from each group or stratum. While not as statistically sound as pure random sampling, it will guarantee that known priority groups or sites are represented. The danger is that because the spread of samples is artificially decided, significant groups or sites may be ignored, and the sample skewed as a result, thus devaluing the information collected.
Estimation is when you take a random sample from a population and use it to estimate some parameter of the population. The best estimate of the mean (average) of a population is just the mean of the sample that has been selected.
Knowing that our samples will have some errors, how can we quantify that error? If we know that our sample was 100% of the total population, we could be 100% confident of the accuracy of our calculations. For example, the results of a population registration that covers 100% of the population would theoretically be 100% accurate. In many cases, we would be happy with being 99% confident, or perhaps only 69%. Quite small samples can give us a high confidence level.
The use of increasingly large samples does not necessarily increase the level of confidence of the resulting assumptions. The way the samples have been chosen is of greater importance. The way the sample is chosen will determine how representative it is of the whole population. Tests exist for working out how much error is associated with any particular assumption on the whole population based on data from only a sample of the population. These can be found in any text book on quantitative methods.
Averages - arithmetic mean, the median, the mean of two means
Averages: averages express the middle point in a set of numbers (known as observations). There are three principle ways to calculate an average. Each method produces a number with a different meaning, so it is important to check how any average was calculated.
The arithmetic mean: this is the most commonly used average. To calculate it, add together all the observations (numbers in the set) and divide the sum by the number of observations. The arithmetical mean is usually the best way to get an average. The trouble with this average is that one non-typical observation distorts the results.
The median: calculate the median by ranking the numbers in order of value, and taking the middle number. If there is an even number of observations, take the two in the middle and divide by 2.
The mode: the mode tells you the most commonly appearing or popular number in a series of observations. Modes are more useful with large number of observations. Sometimes a series of observations has more than one mode.
Standard deviation: the standard deviation is calculated in a similar way to the arithmetic mean. It is basically the mean of the deviations of each observation from the mean. In normal distributions, roughly 68% of a distribution is contained within one standard deviation either side of the mean, and 95% within two standard deviations. Among other things, standard deviation is a very useful statistical method for estimating the variability we can expect in the future. Some of the deviations are minus numbers. If you add all the deviations together, the sum is zero.
Other measures of dispersion
The range: the range is the difference between the highest and lowest numbers in a set. For example, the range of (3; 5; 6; 7; 9; 23; 145) is 145-3 = 142. The trouble with this is that the highest number, 145, is so extreme and thus distorts the range. To deal with this, we can carve up the set into quarters, tenths or hundredths. The values at the dividing points are known as quartiles, deciles and percentiles respectively.
The semi-interquartile: to prevent the distortion mentioned above, we can ignore the first and last quarters of the set of observations, and calculate the range between the first quartile and third quartile, the middle is 50% of the set.
Skew: skews are related to graphs of distribution. Example: a survey of the distribution of a countrys wealth amongst its people. Expressing this on a graph with the number of people on the vertical axis and the amount of wealth on the horizontal might show a distribution with positive skew (longer tail to the right). If you had the observations, you could discover this skew, and hence the distribution shape, using the quartile method described above, so you would know, without drawing the graph, that many people had little wealth and a few people had a lot.
Distributions - normal distribution, coefficients of variation
It is a mysterious fact that, in many circumstances, if you have produced a set of numbers that is influenced by many small independent forces, they form the bell-shaped curve known as the normal curve or Gaussian curve after its discoverer, Gauss. All kinds of measurements have a normal distribution, for example, the levels of IQ in people, small differences in the size of a manufactured item and the height of trees in a forest. You will very often find that data, if sufficiently large, will have a normal distribution. If you know the mean and the standard deviation, you can draw the normal curve; this is the principal reason why standard deviation is so widely used. The mathematics of the normal curve is simpler than that of other curves, and the results obtained often apply quite well to other distribution shapes.
Characteristics of the normal curve: the mean marks the mid-point of the curve. 50% of the distribution is on one side, and 50% on the other.
Almost 100% of the distribution is within three standard deviations either side of the mid-point.
95% of the distribution is within two standard deviations either side of the midpoint.
68% of the distribution is within one standard deviation either side of the mid point.
Is there a connection between smoking and heart disease? Are people who buy novels likely to own a CD player? Statisticians can examine questions like these to see if two such variables are correlated. No correlation is scored as 0 and a perfect correlation is scored as 1. Negative scores mean that high numbers on one variable are correlated with low scores on the other. Correlations can give clues to a relationship, for example, tall people tend to be heavier, but they do not of themselves prove that one variable is causing changes in the other.