Cover Image
close this bookEducational Handbook for Health Personnel (WHO, 1998, 392 p.)
close this folderChapter 2: Evaluation planning
View the document(introduction...)
View the documentWhat is evaluation?
View the documentContinuous evaluation formative and certifying evaluation
View the documentAims of student evaluation1
View the documentCommon methodology for student evaluation1
View the documentComparison of advantages and disadvantages of different types of test
View the documentEvaluation in education qualities of a measuring instrument
View the documentEvaluation is a matter for teamwork

Evaluation in education qualities of a measuring instrument


1. Some definitions

1.1 Education is defined as a process developed for bringing about changes in the student's behaviour. At the end of a given learning period there should be a greater probability that types of behaviour regarded as desirable will appear; other types of behaviour regarded as undesirable should disappear.

1.2 The educational objectives define the desired types of behaviour taken as a whole; the teacher should provide a suitable environment for the student's acquisition of them.

1.3 Evaluation in education is a systematic process which enables the extent to which the student has attained the educational objective to be measured. Evaluation always includes measurements (quantitative or qualitative) plus a value judgement.

1.4 To make measurements, measuring instruments must be available which satisfy certain requirements so that the results mean something to the teacher himself, the school, the student and society which, in the last analysis, has set up the educational structure.

1.5 In education, measuring instruments are generally referred to as “tests”.

2. Qualities of a measuring instrument

Among the qualities of a test, whatever its nature, four are essential, namely, validity, reliability, objectivity and practicability. Others are also important, but they contribute in some degree to the qualities of validity and reliability.

2.1 Validity: the extent to which the test used really measures what it is intended to measure. No outside factors should be allowed to interfere with the manner in which the evaluation is carried out. For instance, in measuring the ability to synthesize, other factors such as style should not compete with the element to be measured so that what is finally measured is style rather than the ability to synthesize.

The notion of validity is a very relative one. It implies a concept of degree, i.e., one may speak of very valid, moderately valid or not very valid results.

The concept of validity is always specific for a particular subject. For example, results of a test on public health administration may be of very high validity for identification of the needs of the country and of little validity for a cost/benefit or cost/efficiency analysis.

Content validity is determined by the following question: will this test measure, or has it measured, the matter and the behaviour that it is intended to measure?

Predictive validity is determined by questions such as the following when the results of a test are to be used for predicting the performance of a student in another domain or in another situation:

To what extent do the results obtained in physiology help to predict performance in pathology?

To what extent do the results obtained during the pre-clinical years help in predicting the success of students during the clinical years?

2.2 Reliability: this is the consistency with which an instrument measures a given variable.

Reliability is always connected with a particular type of consistency: the consistency of the results in time; consistency of results according to the questions; consistency of the results according to the examiners.

Reliability is a necessary but not a sufficient condition for validity. In other words, valid results are necessarily reliable, but reliable results are not necessarily valid. Consequently, results that are not very reliable affect the degree of validity. Unlike validity, reliability is a strictly statistical concept and is expressed by means of a reliability coefficient or through the standard error of the measurements made.

Reliability can therefore be defined as the degree of confidence that can be placed in the results of an examination. It is the consistency with which a test gives the results expected.

2.3 Objectivity: this is the extent to which several independent and competent examiners agree on what constitutes an acceptable level of performance.

2.4 Practicability depends upon the time required to construct an examination, to administer and score it, and to interpret the results, and on its overall simplicity of use. It should never take precedence over the validity of the test.

3. Other qualities of a measuring instrument

3.1 Relevance: this is the degree to which the criteria established for selecting questions (items) so that they conform to the aims of the measuring instrument are respected. This notion is almost identical to the one of content validity; and the two qualities are established in a similar manner.

3.2 Equilibrium: achievement of the correct proportion among questions allocated to each of the objectives.

3.3 Equity: extent to which the questions set in the examination correspond to the teaching content.

3.4 Specificity: quality of a measuring instrument whereby an intelligent student who has not followed the teaching on the basis of which the instrument has been constructed will obtain a result equivalent to that expected by pure chance.

3.5 Discrimination: quality of each element of a measuring instrument which makes it possible to distinguish between good and poor students in relation to a given variable.

3.6 Efficiency: quality of a measuring instrument which ensures the greatest possible number of independent answers per unit of time.1

1 This definition of efficiency has a narrower meaning than the one given in the glossary (p. 6.05); it applies only to evaluation instruments (pp. 2.33 - 2.37).

3.7 Time: it is well known that a measuring instrument will be less reliable if it leads to the introduction of non-relevant factors (guessing, taking risks or chances, etc.) because the time allowed is too short.

3.8 Length: the reliability of a measuring instrument can be increased almost indefinitely (Spearman-Brown formula) by the addition of new questions equivalent to those constituting the original instrument.


The extent to which the instrument really measures what it is intended to measure.


The consistency with which an instrument measures a given variable.


The extent to which several independent and competent examiners agree on what constitutes an acceptable level of performance.


The overall simplicity of use of a test, both for test constructor and for students.


Relationships between the characteristics of an examination


The diagram on the next page, suggested by G. Cormier, represents an attempt to sum up the concepts of testing worked out by a number of authors. However, no diagram can give a perfect representation of reality and the purpose of the following lines is to explain rather than justify the diagram.

A very good treatment of all these concepts will be found in the book by Robert Ebel entitled Measuring Educational Achievement (Prentice Hall, 1965).

Validity and reliability

Ebel shows that “to be valid a measuring instrument (test) must be both relevant and reliable.” This assertion justifies the initial dichotomy of the diagram. It is, moreover, generally agreed that “a test can often, if not always, be made more valid if its reliability is increased.”

Validity and relevance

According to Ebel's comments, it seems that the concept of relevance corresponds more or less to that of validity of content. In any case, both are established in a similar manner (by consensus).

By definition, a question is relevant if it adds to the validity of the instrument, and an instrument is relevant if it respects the specifications (objectives and taxonomic levels) established during its preparation.

Relevance and equilibrium

It seems, moreover, that the concept of equilibrium is only a sub-category of the concept of relevance and that is why the diagram shows it as such.

Relevance and equity

It seems evident that if the instrument is constructed on the basis of a content itself determined by objectives, then it will be relevant by definition. If this is not done, then the instrument will not be relevant and consequently not valid. It is equitable in the first case and non-equitable in the second. However, an examination can be equitable without being relevant (or valid) when, although it corresponds well to the teaching content, the latter is not adequately derived from the objectives.

Equity, specificity and reliability

The diagram reflects the following implicit relationship: a test cannot be equitable if it is not first specific. Specificity, just like equity and for similar reasons, will affect the reliability of the results.

Reliability, discrimination, length, homogeneity (of questions) and heterogeneity (of students)

According to Ebel, reliability is influenced by the extent to which the questions (items) clearly distinguish competent from incompetent students, the number of items, the similarity of the items as regards their power to measure a given skill and the extent to which students are dissimilar with respect to that skill. The discriminating power of a question is directly influenced (see pages 4.73 - 4.75) by its level of difficulty. The mean discrimination index of an instrument will also be affected by the homogeneity of the questions and the heterogeneity of the students. From the comments made above it can be seen how equity and specificity will also influence the discriminating power of the instrument.


Try to answer questions 22 - 25 on p. 2.47 and check your answers on p, 2.48.

Relationships between characteristics of an examination1

1 As proposed by G. Cormier, Universitaval, Quebec.

N.B. Additional relationships to those suggested in this diagram can be established. The number of links has been kept to a minimum for the sake of clarity and to give a basic idea of the concept as a whole.




For each of the educational objectives you defined on page 1.68, describe two methods of evaluation that seem suitable to you for informing yourself and the student on the extent to which that objective has been achieved. Compare the two methods on the basis of the three criteria shown in the table below.

Examples of methods of evaluation for a class of 200 students


Make a differential diagnosis of anaemia based on the detailed haematological picture described in the patient's medical record.





Modified essay question. A series of 10 short questions based on patient's record as supplied to student (1 hour).





Student given patient's record (10 mins.) followed by 15 min. oral examination.




Methods of evaluation for a class of... students1

1 Choose a number of students that is realistic in your situation.


The student should be able to:






Check the meaning of the words validity, objectivity and practicability in the glossary, page 6.01.

For evaluation the essential quality is validity

but don't forget that for an educational system considered as a whole it is its relevance that is of primary importance