| Measuring drought and drought impacts in Red Sea Province |
There are two kinds of statistics:
1. Descriptive: used to organise and summarise data.
2. Inferential: based on probability theory and used to make educated guesses about a population based on information obtained from a sample of the population.
A brief description of each statistical method used in the reports is given below.
1. Average, or measure of central tendency.
The average is a general term used to describe where the central or most typical value of a data set lies. There are three measures of centrality, or the average.
i. The mean. The mean is the sum of the data divided by the number of pieces of data. It is the preferred measure of central tendency for continuous (metric) data providing there are not large numbers of very big or very small values. This is the most commonly used measure of central tendency.
ii. The median. Defines a number which is the dividing point between the top 50% of the data and the bottom 50% of the data Used mostly with ranked or ordered data on a scale (ordinal data).
iii. The mode. The value that appears most often in the data set, which might not be the middle in any sense. Used most often for qualitative data.
2. Measures of dispersion.
A measure of dispersion is a number used to show how much variation exists in a data set around a central point.
i. Sample standard deviation (s or sd). Deviation refers to deviation from the mean (an individual value minus the mean value). The average (mean) of these deviations is then calculated. The more variation there is in a data set, the bigger the standard deviation. In any data set, almost all the values fall within three standard deviations either side of the mean.
ii. Coefficient of variation (CV). This expresses the standard deviation as a percentage of the mean.
Inferential statistics are based on probability theory. A brief description of each of the statistical teens used in the report will be given here. Details of the theory and calculations involved can be found in several basic textbooks (see below).
1. Statistical significance. A level of probability which is set as a cut off point for determining if differences between two populations are due to some determining factor, or whether they are due to sampling error or chance. The conventional level for this probability (p) is 0.05. The differences are accepted as being due to some determining factor only if the same results would happen by chance less than 5% of the time. We are 95% sure that the results are not due to chance.
2. Z scores (zee scores!). Z scores are calculated by subtracting an individual value from the mean value for the sample, and dividing the result by the standard deviation for the sample. Z scores are standardised scores. Raw data are transformed into a form in which many different types of data can be directly compared on the same scale. This scale is expressed in terms of standard deviations from the mean, and can also be used for calculating the probability of certain scores or proportions of scores occurring by chance. In the population 68% of z scores will be between -1 and 1 9596 will be between -2 and 2, and 99.7% will be between -3 and 3. The closer a Z score is to zero, the closer it is to the mean.
3. Confidence intervals. A confidence interval is a range of values &round the sample mean within which we can be reasonably confident that the true population mean or proportion lies, based on information taken from a sample of the population and probability theory. "Reasonably confident" usually is taken as 95% confident. A confidence interval can be constructed for the differences between means or proportions also. If this confidence interval includes zero, it is probable that there is no true difference between the populations. If the confidence interval does not include zero it is probable that differences between the populations are due to some determining factor other than chance.
4. T-tests. Tests the (null) hypothesis that two samples come from two populations with the same mean and differ only because of sampling error. Requires the sample size, mean and variance to be known. T-tests assume that the variances in the two populations being compared are equal.
5. One way analysis of variance (ANOVA). Used to test for simultaneous equalities between groups of means, based on two variance measures: one to measure variance within the groups being compared, and one to measure the variance between the groups being compared. These two measures are compared using the F-test. It answers the question "is the variability between groups large enough in comparison with the variability within groups to justify the inference that the means of the populations from which the different groups were sampled were not the same?" If it is determined that the groups being tested are not simultaneously equal, further tests are required to pinpoint exactly where these differences occur (post hoc tests). ANOVA can tell us how much of the variation in one variable is explained by its interaction with another.
6. Simple correlation. Measures the degree of linear, or "straight-line", relationship between two variables. It may be positive (direct) or negative (inverse). It is expressed as a correlation coefficient (r) which is between -1 and 1. If the correlation coefficient is 1, all data points lie on a straight line with a positive slope. If the correlation coefficient is -1, all data points lie on a straight line with a negative slope. A correlation, however strong, does not imply causality. Inferences about correlations in the population can be made from the sample correlation. Significance testing can tell us whether the correlation coefficient is too big or too small to provide useful information. Different methods of calculating correlation coefficients can be used depending on the characteristics of the data being investigated.
a. Simple regression. Where correlation describes the strength of a linear relationship between two variables, simple regression describes the form of this relationship and allows us to make predictions about how values outside the sample will behave. Using simple regression it is possible to determine to what extent change in one variable influence change in another variable.
b. Multiple regression. An extension of simple regression which is widely used for determining the relationship between one outcome variable and a combination of two or more predictor variables. This technique is particularly useful for examining complex data sets where one outcome variable (such as percent weight for height) may be influenced by many other predictor variables (ranging from age to mother's education, for example). A model of how all these predictor variables best fit together to explain changes in the outcome variable is produced.
8. Factor analysis. A complex technique for analyzing patterns of common variation, or intercorrelation, among many variables and isolating the dimensions to account for these patterns. Factor analysis describes the patterns of correlation between many variables in terms of a relatively small number of common factors. It is particularly useful for generating new hypotheses about the relationships between variables.
Weiss, N. and Hassett M. (1982) Introductory Statistics. Addison-Wesley Publishing Co. Inc. Reading, Massachusetts.
Isaac, S. and Michael W.B. (1983) Handbook in Research and Evaluation for Education and the Behaviourial Sciences. EdITS publishers, San Diego, California.
Sage University Paper Series on Quantitative Applications in the Social Sciences. Sage Publications. Beverley Hills, California.
Concepts and Techniques in Modern Geography (CATMOG) series, Geo Abstracts, University of East Anglia, Norwich.
World Health Organisation (WHO) (1983) Measuring Change in Nutritional Status. WHO, Rome.