Cover Image
close this bookEducational Handbook for Health Personnel (WHO, 1998, 392 p.)
close this folderChapter 2: Evaluation planning
View the document(introduction...)
View the documentWhat is evaluation?
View the documentContinuous evaluation formative and certifying evaluation
View the documentAims of student evaluation1
View the documentCommon methodology for student evaluation1
View the documentComparison of advantages and disadvantages of different types of test
View the documentEvaluation in education qualities of a measuring instrument
View the documentEvaluation is a matter for teamwork



Evaluation planning


This second chapter presents basic concepts in the field of educational evaluation. It stresses the very close relationship between evaluation and definition of educational objectives; and the primary role of any evaluation, which is to facilitate decision-making by those responsible for an educational system. It defines the subject, the purpose, the goals and the stages of evaluation and highlights the concepts of validity and relevance.

Those who would like to learn more about these problems should consult the following publications:

Development of educational programmes for the health professions. WHO, 1973 (Public Health Papers No. 52).

Evaluation of school performance, educational documentation and information. Bulletin of the International Bureau of Education, No. 184, third quarter 1972, 84 pages.

After having studied this chapter and the reference documents mentioned you should be able to:

1. Draw a diagram showing the relationship between evaluation and the other parts of the educational process.

2. Define the principal role of evaluation, its purpose and its aims.

3. Describe the difference between formative and certifying evaluation.

4. List the good and bad features of a test.

5. Compare the advantages and disadvantages of tests in current use.

6. Define the following terms: validity, reliability, objectivity, and describe the relationship that exists between them.

7. Choose an appropriate evaluation method (questionnaire, written examination, “objective” test [MCQ or short-answer questions] or essay question, oral examination, direct observation, etc.) for measuring the students' attainment of a specific educational objective. Compare the alternatives in a specification table.

8. Define (in the form of an organizational diagram) the organization of an evaluation system suitable for your establishment, and list the stages involved.


(a) the most important educational decisions you have to take;

(b) the data to be collected to provide a basis for those decisions;

(c) the aims of the system and sub-systems in terms of decisions to be taken and the object of each decision (teachers, students, programmes).

9. Identify obstacles to and strategies for improvement of a system of evaluating students, teachers and programmes.

To change curricula or instructional methods without changing examinations would achieve nothing!

Changing the examination system without changing the curriculum had a much more profound impact upon the nature of learning than changing the curriculum without altering the examination system.

G.E. Miller1


1 International Medical Symposium No. 2. Rome. 23-26 March 1977.

What is evaluation?


An analysis of educational innovations all over the world confirms G. Miller's opinion. In this second chapter, therefore, you are invited to plan a system of evaluation that can be used as a basis for preparation and implementation of a programme. The process is already under way, for the formulation of specific educational objectives requires definition of criteria indicating the minimum level of performance expected from the student. Educational decisions have to be made frequently during preparation and implementation of a programme; and the main purpose of evaluation is in fact to provide a basis for value judgements that permit better educational decision-making. First of all you must decide what you want to evaluate: students, teachers and/or programmes. In each case you must determine what important educational decisions you will be expected to make in your capacity as teacher or administrator, for the instruments and mechanisms of evaluation providing data for value judgements will be developed and used according to the type of decision required. A general methodology of evaluation and corresponding techniques do exist. Some are simple; others very complex and costly in time and money. Here again you will make your choice according to criteria that will ensure an adequate level of security. As in every educational process, you will have to shape all the consequences of your decisions into a coherent and logical whole. You are therefore invited to read the next pages before doing the exercise on p. 2.09.

The person who sets the examination controls the programme.

Education by objectives is not possible unless examinations are constructed to measure attainment of those objectives.

The educational planning spiral


The evaluation process provides a basis for value judgements that permit better educational decision-making


Notice to all teachers

You are reminded that evaluation of education must begin with a clear and meaningful definition of its objectives, as derived from the priority health problems and the professional profile




of whom?

of what?

· Students

· Teachers

· Programmes and courses

........................................ in relation to what?

· In relation to educational objectives.

(They are the common denominator.)



Answer question 2 on p. 2.45.
Check your answer on p. 2.48.



Before starting to define the organization, stages or methods of an evaluation system suitable for the establishment in which you are teaching, it would be useful to stale:

What important educational decisions* you think you and your colleagues will be taking over the next three years.

* Examples of educational decisions:

- to decide which students will be allowed to move up from the first to the second year

- or to decide to purchase an overhead projector rather than a blackboard

- or to decide to appoint Mr X full professor

- or to decide:


You and your colleagues will have to make value judgements as a basis for each decision. It will therefore be useful to plan the construction and use of “instruments of evaluation” that will enable you to collect the data needed for making those value judgments (see pp. 2.40 and 2.41).

Personal notes


Evaluation - a few assumptions1

1 Adapted from Downie, N.M. Fundamentals of Measurement: Techniques and Practices. New York, Oxford University Press, 1967.


Education is a process, the chief goal of which is to bring about changes in human behaviour.

The sorts of behavioural changes that the school attempts to bring about constitute its objectives.

Evaluation consists of finding out the extent to which each and every one of these objectives has been attained, and determining the quality of the teaching techniques used and of the teachers.

Assumptions underlying basic educational measurement and evaluation1


1 See footnote to page 2.11.

Human behaviour is so complex that it cannot be described or summarized in a single score.

The manner in which an individual organizes his behaviour patterns is an important aspect to be appraised. Information gathered as a result of measurement or evaluation activities must be interpreted as a part of the whole. Interpretation of small bits of behaviour as they stand alone is of little real meaning.

The techniques of measurement and evaluation are not limited to the usual paper-and-pencil tests. Any bit of valid evidence that helps a professor or counsellor in better understanding a student and that leads to helping the student to understand himself better is to be considered worth while.

Attempts should be made to obtain all such evidence by any means that seem to work.

The nature of the measurement and appraisal techniques used influences the type of learning that goes on in a classroom. If students are constantly evaluated on knowledge of subject-matter content, they will tend to study this alone. Professors will also concentrate their teaching efforts upon this. A wide range of evaluation activities covering various objectives of a course will lead to varied learning and teaching experiences within a course.

The development of any evaluation programme is the responsibility of the professors, the school administrators, and the students. Maximum value can be derived from the participation of all concerned.

The philosophy of evaluation1

1 See footnote to page 2.11.


1. Each individual should receive the education that most fully allows him to develop his potential.

2. Each individual should be so placed that he contributes to society and receives personal satisfaction in so doing.

3. Fullest development of the individual requires recognition of his essential individuality along with some rational appraisal by himself and others.

4. The judgements required in assessing an individual's potential are complex in their composition, difficult to make, and filled with error.

5. Such error can be reduced but never eliminated. Hence any evaluation can never be considered final.

6. Composite assessment by a group of individuals is much less likely to be in error than assessment made by a single person.

7. The efforts of a conscientious group of individuals to develop more reliable and valid appraisal methods lead to the clarification of the criteria for judgement and reduce the error and resulting wrongs.

8. Every form of appraisal will have critics, which is a spur to change and improvement.

The psychology of evaluation1

1 See footnote to page 2.11.


1. For evaluation activities to be most effective, they should consist of the best possible techniques, used in accordance with what we know to be the best and most effective psychological principles.

2. For many years readiness has been recognized as a very important prerequisite for learning. A student is ready when he understands and accepts the values and objectives involved.

3. It has long been known that people tend to carry on those activities which have success associated with their results. This has been known as Thorndike's Law of Effect. Students in any classroom soon come to realize that certain types of behaviour are associated with success - in this case, high marks on a test or grades in a course. Thus, if a certain teacher uses tests that demand rote memory, the students will become memorizers. If a test, on the other hand, requires students to apply principles, interpret data, or solve problems, the students will study with the idea of becoming best fitted to do well on these types of test items. In the long run, the type of evaluation device used determines, to a great extent, the type of learning activity in which students will engage in the classroom.

4. Early experiments in human learning showed that individuals learn better when they are “constantly” appraised in a meaningful manner as to how well they are doing.

5. The motivation of students is one of the most important - and sometimes the most difficult to handle - of all problems related to evaluation. It is redundant for us to say that a person's performance on a test is directly related to his motivation. Research has shown that when a student is really motivated, performance is much closer to his top performance than when motivation is lacking.

6. Learning is most efficient when there is activity on the part of the learner.


Try to answer question 3 on p. 2.45.
Check your answer on p. 2.48.

Evaluation is

a continuous process

based upon criteria

cooperatively developed

concerned with measurement of the performance of learners, the effectiveness of teachers and the quality of the programme1

1 This chapter is mainly concerned with the evaluation of students. Evaluation of programmes and teachers is dealt with in chapter 4.


Continuous evaluation formative and certifying evaluation


You will find the following equivalents in the literature for these two expressions:

Formative evaluation


diagnostic evaluation

Certifying evaluation


summative evaluation

Evaluation of education must begin with a clear and meaningful definition of its objectives. We cannot measure something unless we have first defined what it is we wish to measure.

When this phase of evaluation (the definition of objectives) has been properly completed, the choice or development of suitable evaluation procedures is that much easier. Schematically represented, the educational planning spiral (p. 2.05) comprises the determination of objectives, the planning of an evaluation system, the development of teaching activities and the implementation of evaluation procedures with possible revision of objectives.

The role of evaluation should not be limited to one of penalization. It should not be just a series of only too frequent obstacles which the students are supposed to get over and which become their sole subject of concern, the actual instruction becoming quite secondary. Under these circumstances the student's only interest is how to obtain his diploma with least effort. It is the teacher's responsibility to convince the student that his education is directed towards wider aims than merely gaining a diploma and that helping him to do so is not the sole purpose of evaluation (see p. 2.18 and 2.19).

Evaluation should also be formative, providing the student with information on his progress. It must therefore be continuously possible. This concept has often been misinterpreted, resulting in constant harassment of the student. There is a fundamental difference between formative and certifying evaluation. In both cases the evaluation tools must have the same level of difficulty and discrimination (see pp. 4.77-4.81).

Strict Rule

Evaluation should in no way be used by the teacher against the student.

Formative evaluation1

1 Read the article by C. McGuire - Diagnostic examinations in medical education. In: Development of educational programmes for the health professions. Geneva, WHO, 1973 (Public Health Papers No. 52).


- is designed to inform the student about the amount he still has to learn before achieving his educational objectives;

- measures the progress or gains made by the student from the moment he begins a programme until the time he completes it;

- enables learning activities to be adjusted in accordance with progress made or lack of it; it is therefore a teaching method;

- is very useful in guiding the student in his own learning and prompting him to ask for help;

- is controlled in its use by the student (results should not appear in any official record);

- is carried out frequently - as often as the student feels necessary;

- should in no way be used by the teacher to make a certifying judgement; the anonymity of the student should be safeguarded by use of a code of his choice. A coding system makes it possible to follow the progress of individuals and groups while preserving anonymity;

- provides the teacher with qualitative and quantitative data for modification of his teaching (particularly contributory educational objectives) or otherwise.

Certifying evaluation

- is designed to protect society by preventing incompetent personnel from practising;

- is traditionally used for placing students in order of merit and justifying decisions as to whether they should move up to the next class or be awarded a diploma;

- is cumulative, and carried out less frequently than formative evaluation, but at least at the end of a unit or period of instruction.


Try to answer questions 4 - 8 on p. 2.45.
Check your answers on p. 2.48.

We don't care how hard the student tried, we don't care how close he got... until he can perform he must not be certified as being able to perform.

R.F. Mager

Continuous evaluation must pit the student against himself and his own lack of competence and not against other students.

Evaluation of what?


Elements needed for the construction of an evaluation system

Evaluation should be built into all phases of programme construction. The following elements should be taken into consideration: firstly, the context in which the programme is being prepared, then the various inputs to the programme and, finally, the educational process and the performance of the learners.

1. Planning the evaluation of situation analysis and the identification of priority health problems (context)

Evaluation of the context is concerned with the initial decisions of importance for the educational programme. It is linked to the situation analysis where all the information of importance for the programme is available. If the information available is not satisfactory, it may be necessary to collect further information in order to arrive at the right educational decisions. This may include analysis of factors in the learners' potential job environment, selection of various job descriptions and employers' opinions on the performance of earlier students in their jobs. The analysis made in chapter 1 could thus be part of a context evaluation. The “climate” that exists in relation to the programme, the content, the methods, and resources used in the programme are all contextual aspects of importance for the planning stage.

2. Planning the evaluation of the human and material resources to be used and the elements to be included in the programme (the inputs)

At all stages of the learning process there are educational decisions to be taken by teachers. It is therefore important to make sure that teachers are competent and comfortable with the teaching methodology to be used (i.e. problem-based education), and if not, that they are given the training required; some kind of evaluation must also be planned to discourage teachers from putting students in a passive learning situation; and the programme itself must be subjected to careful scrutiny before it is actually implemented.

3. Planning the monitoring of implementation (the educational process)

An evaluation system must also plan how the implementation of the programme is to be monitored. This should detect the need for modification or replacement of any of the teaching/learning activities in the programme.

4. Planning the evaluation of learners (the output)

The central component of an evaluation system is the evaluation of the learners' performance. At this stage of planning, decisions must be made on the establishment of an evaluation committee, identification of persons to prepare instruments of evaluation, and the various administrative arrangements to be made for the evaluation of the learners' performance.

As this element is of paramount importance, we shall examine it next.

Student evaluation: what for?


The numbers on the left refer to the exercise on this page and the questions on p. 2.46.

9 Incentive to learn (motivation)


10 Feedback to student


11 Modification of learning activities


12 Selection of students

} appropriate measuring techniques

13 Success or failure


14 Feedback to teacher


15 School public relations


16 Protection of society (certification of competence)



Now try it ... indicate for each of the aims of evaluation (numbered 9-16) whether the measurement technique will be of the certifying evaluation type (C) or both certifying and formative evaluation (CF). Check your answers on p. 2.48.

Aims of student evaluation1

1 Adapted from Downie, N.M. Fundamentals of Measurement: Techniques and Practices. New York, Oxford University Press, 1967.


1. To determine success or failure on the part of the student. This is the conventional role of examinations (certifying evaluation).

2. To provide “feedback” for the student: to keep him constantly informed about the instruction he is receiving; to tell him what level he has reached; and to make him aware through the examination of what parts of the course he has not understood (formative and certifying evaluation).

3. To provide “feedback” for the teacher: to inform him whether a group of students has not understood what he has been trying to explain. This enables him to modify his teaching where necessary to ensure that what he wishes to communicate to the students is correctly understood (formative and certifying evaluation).

4. The “reputation of the school” is something of which the importance is not always evident, at least in European schools, whose reputation is often based not on an examination system but on long-standing traditions. North American schools, on the other hand, customarily publish the percentage of students who have passed, for example, national examinations (formative and certifying evaluation).

Why does an educational programme fail?


To begin instruction before a proper system of evaluation has been constructed is likely to be a waste of effort, time and other resources. All educational programmes will experience failures and problems at some time. Without proper evaluation of all its elements for formative purposes, you might have difficulty in understanding why the programme has failed. But one of the advantages of a system of continuous evaluation is that you will usually be able to prevent failures. Romiszowski (1984)1 has pointed out that “promising new instructional systems have been known to fail because no account has been taken of this simple principle (formative evaluation). Once the initial field-testing stage has come to a close, yielding excellent results, a project enters its final phase of regular, large-scale use and, slowly, a form of “drift” takes place, carrying it further and further away from the changing reality in which it was implanted. Thus, as in the case of an alien organ implanted without due care in a living organism, a rejection phase is reached and the new instructional system is eliminated, killed off by the “antibodies” in its environment. The way to avoid rejection of an implanted sub-system is to maintain a high level of compatibility between the new system and older, more established systems in its environment. As these are in constant change, the new system must also constantly adapt itself”.

1 See footnote to page 1.72.

Four steps in student evaluation


Once you are satisfied with the quality of the criteria (acceptable level of performance) of the educational objectives


Develop and use measuring instruments


Interpret measurement data


Formulate judgements and take appropriate action

Common methodology for student evaluation1

1 See also Rezler, A.G. The assessment of attitudes. In: Development of educational programmes for the health professions. Geneva, WHO, 1973 (Public Health Papers No. 52), pp. 70-83.


Evaluation of practical skills
Evaluation of communication skills
Evaluation of knowledge and intellectual skills

1. Make a list of observable types of behaviour showing that the objective pursued has been reached.

2. Make a list of observable types of behaviour showing that the objective pursued has not been reached.

3. Determine the essential features of behaviour in both lists.

4. Assign a positive or negative weight to the items on both lists.

5. Decide on the acceptable performance score.

* For the last three stages obtain the agreement of several experts.

Example. Objective: Reassure the mother of a child admitted to hospital







Explain clearly what has been done to the child

often uses medical terms and never explains what they mean

often uses medical terms and rarely explains what they mean

rarely uses medical terms and sometimes explains what they mean

rarely uses medical terms and always explains what they mean

uses only terms suited to the mother's vocabulary

etc. See the complete table on p. 4.32.

Minimum Performance Score: The student should score n marks out of 10 on the rating scale.



Try to answer questions 17 - 20 on p. 2.46 and check your answers on p. 2.48.

Evaluation methodology according to domains to be evaluated




For each of the educational objectives you have already defined (pp. 1.68, 1.69), choose from among the methods of evaluation set out on p. 2.22 the one you think most suitable for informing you and the student on the extent to which the objective has been achieved.


Method Of evaluation

Instrument of evaluation

1 page 1.68

Indirect method

Short, open-answer question based on the patient's record


2 page 1.68

Indirect method


3 page 1.68

Direct observation

Practical examination






For the purposes of this exercise the total number of students to be considered should be fixed: e.g., 100, or any other number that is realistic in your situation.

Personal notes


General remarks concerning examinations


Analysis of the most commonly used tests shows that sometimes, often even, the questions set are ambiguous, unclear, disputable, esoteric or trivial. It is essential for anyone constructing an examination, whether of the traditional written type, an objective test or a practical test, to submit it to his colleagues for criticism to make sure that its content is relevant (related to an educational objective) and of general interest, and does not exclusively concern a special interest or taste of the author; that the subject is interesting and real for the general practitioner or the physicians with a specialty different from that of the author; and that the questions (and the answers in the case of multiple-choice questions) are so formulated that experts can agree on the correct response. It is clear that a critical analysis along these lines would avoid the oversimplification of many tests which only too often justifies the conclusion: “the more you know about a question the lower will be your score”.

The author of a test is not the best judge of its clarity, precision, relevance and interest. Critical review of the test by colleagues is consequently essential for its sound construction.

Moreover, an examination must take the factor of practicability into account. This will be governed by the time necessary for its construction and administration, scoring and interpretation of the results, and by its general ease of use.

If the examination methods employed become a burden on the teacher because of their impractical nature he will tend not to assign to the measuring instrument the importance it deserves.

A discussion is not always pertinent to the problem at hand, but one learns to allow for some rambling. It seems to help people realize that they normally use quite a few fuzzies during what they consider “technical discussions”; it helps them realize that they don't really know what they are talking about... a little rambling helps clear the air. Asking someone to define his goal in terms of performance is a little like asking someone to take his clothes off in public - if he hasn't done it before, he may need time to get used to the idea.

R.F. Mager

Qualities of a test


Directly related to educational objectives
Realistic and practical
Concerned with important and useful matters
Comprehensive but brief
Precise and clear

Judge the consequences of the student's not achieving the objective by answering such questions as: “If he cannot perform the objective when he leaves my instruction he is likely to.....”. The answer should help you decide how much energy to put into constructing a valid evaluation system to find out whether the objective is achieved as written.

R.F. Mager

Considerations of the type of competence a test purports to measure


No test format (objective, essay or oral) has a monopoly on the measurement of the highest and more complex intellectual processes. Studies of various types of tests support the view that the essay and the oral examination, as commonly employed, test predominantly simple recall and, like the objective tests in current use, rarely require the student to engage in reasoning and problem-solving. In short, the form of a question does not necessarily determine the nature of the intellectual process required to answer it.

Second, there is often a tendency to confuse the difficulty of a question with the complexity of the intellectual process measured by it. However, it should be noted that a question requiring simple recall may be very “difficult” because of the esoteric nature of the information demanded; alternatively, a question requiring interpretation of data or application of principles could be quite “easy” because the principles of interpretation are so familiar and the data to be analysed so simple. In short, question difficulty and complexity of instructions are not necessarily related to the nature of the intellectual process being tested.

Third, there is often a strong inclination to assume that any question which includes data about a specific case necessarily involves problem-solving, whereas, in fact, “data” are often merely “window dressing” when the question is really addressed to a general condition and can be answered equally well without reference to the data. Or, the data furnished about a “specific case” may constitute a “cut-and-dried”, classical textbook picture that, for example, simply requires the student to recall symptoms associated with a specific diagnosis. It is interesting to note that questions of this type can readily be converted into problems that do require interpretation of data and evaluation, simply by making the case material conform more closely to the kind of reality that an actual case, rather than a textbook, presents.

In short, just as each patient in the ward or outpatient department represents a unique configuration of findings that must be analysed, a test that purports to measure the student's clinical judgement and his ability to solve clinical problems must simulate reality as closely as possible by presenting him with specific constellations of data that are in some respects unique and, in that sense, are new to him. Do not try to use a MCQ or a SOAQ to find out whether the student is able to communicate orally with a patient!

However reliable or objective a test may be, it is of no value if it does not measure ability to perform the tasks expected of a health worker in his/her professional capacity.

Common defects of examinations (domain of intellectual skills)


A review of examinations currently in use strongly suggests that the most common defects of testing are:


the triviality of the questions asked, which is all the more serious in that examination questions can only represent a small sample of all those that could be asked. Consequently it is essential for each question to be important and useful;

Outright error

outright error in phrasing the question (or, in the case of multiple-choice questions, in phrasing the distractors and the correct response);


ambiguity in the use of language which may lead the student to spend more time in trying to understand the question than in answering it; in addition to the risk of his giving an irrelevant answer;


forcing the student to answer in terms of the outmoded ideas of the examiner, a bias which is well known and often aggravated by the teaching methods themselves (particularly traditional lectures);


requesting the student to answer in terms of the personal preferences of the examiner when several equally correct options are available;


complexity or ambiguity of the subject matter taught, so that the search for the correct answer is more difficult than was anticipated;

Unintended cues

unintended cues in the formulation of the questions that make the correct answer obvious; this fault, which is often found in multiple-choice questions, is just as frequent in oral examinations.

Outside factors to be avoided


In constructing an examination, outside factors must not be allowed to interfere with the factor to be measured.

Complicated instructions (ability to understand instructions)

In some tests, the instructions for students on how to solve the problems are so complicated that what is really evaluated is the students' aptitude to understand the question rather than their actual knowledge and ability to use it. This criticism is often made of multiple-choice examinations in which the instructions appear too complicated. The complexity is often more apparent than real and disturbs the teacher rather than the student.

Over-elaborate style (ability to avoid traps)

The student may disguise his lack of knowledge in such elegant prose that he succeeds in influencing the corrector, who judges the words and style rather than the student's knowledge.

Trap questions (ability to use words)

This type of interference does not depend on a measuring instrument, but on possible sadistic tendencies on the part of the examiner who, during an examination, may allow himself to be influenced by the candidate's appearance, sex, etc. Some candidates are more or less skilled at playing on these tendencies.


This is a criticism that is generally made of multiple-choice examinations; it may in fact be applied to other forms of evaluation. In oral and written examinations, students develop a sixth sense, often based on statistical analysis of past questions, which enables them somehow to predict the questions that will be set.

Comparison of advantages and disadvantages of different types of test


Oral examinations



1. Provide direct personal contact with candidates.
2. Provide opportunity to take mitigating circumstances into account.
3. Provide flexibility in moving from candidate's strong points to weak areas.
4. Require the candidate to formulate his own replies without cues.
5. Provide opportunity to question the candidate about how he arrived at an answer.
6. Provide opportunity for simultaneous assessment by two examiners.

1. Lack standardization.
2. Lack objectivity and reproducibility of results.
3. Permit favouritism and possible abuse of the personal contact.
4. Suffer from undue influence of irrelevant factors.
5. Suffer from shortage of trained examiners to administer the examination.
6. Are excessively costly in terms of professional time in relation to the limited value of the information yielded.

Unfortunately all these advantages are rarely used in practice.

Practical examinations, projects



1. Provide opportunity to test in a realistic setting skills involving all the senses while the examiner observes and checks performance.
2. Provide opportunity to confront the candidate with problems he has not met before both in the laboratory and at the bedside, to test his investigative ability as opposed to his ability to apply ready-made “recipes”.
3. Provide opportunity to observe and test attitudes and responsiveness to a complex situation (videotape recording).
4. Provide opportunity to test the ability to communicate under pressure, to discriminate between important and trivial issues, to arrange the data in a final form.

1. Lack standardized conditions in laboratory experiments using animals, in surveys in the community or in bedside examinations with patients of varying degrees of cooperativeness1.
2. Lack objectivity and suffer from intrusion or irrelevant factors.
3. Are of limited feasibility for large groups.
4. Entail difficulties in arranging for examiners to observe candidates demonstrating the skills to be tested.

Essay examinations



1. Provide candidate with opportunity to demonstrate his knowledge and his ability to organize ideas and express them effectively.

1. Limit severely the area of the student's total work that can be sampled.
2. Lack objectivity.
3. Provide little useful feedback.
4. Take a long time to score.

Multiple-choice questions



1. Ensure objectivity, reliability and validity; preparation of questions with colleagues provides constructive criticism.
2. Increase significantly the range and variety of facts that can be sampled in a given time.
3. Provide precise and unambiguous measurement of the higher intellectual processes.
4. Provide detailed feedback for both student and teachers.
5. Are easy and rapid to score.

1. Take a long time to construct in order to avoid arbitrary and ambiguous questions.
2. Also require careful preparation to avoid preponderance of questions testing only recall.
3. Provide cues that do not exist in practice.
4. Are “costly” where number of students is small.

1 Standardized practical tests can be constructed; see McGuire, C.H. & Wezeman, F.H. Simulation in instruction and evaluation in medicine. In: Miller, G.E. & FT., eds., Educational strategies for the health professions. Geneva, WHO, 1974 (Public Health Papers No. 61).

It is a highly questionable practice to label someone as having achieved a goal when you don't even know what you would take as evidence of achievement.

R.F. Mager

Personal notes


Evaluation in education qualities of a measuring instrument


1. Some definitions

1.1 Education is defined as a process developed for bringing about changes in the student's behaviour. At the end of a given learning period there should be a greater probability that types of behaviour regarded as desirable will appear; other types of behaviour regarded as undesirable should disappear.

1.2 The educational objectives define the desired types of behaviour taken as a whole; the teacher should provide a suitable environment for the student's acquisition of them.

1.3 Evaluation in education is a systematic process which enables the extent to which the student has attained the educational objective to be measured. Evaluation always includes measurements (quantitative or qualitative) plus a value judgement.

1.4 To make measurements, measuring instruments must be available which satisfy certain requirements so that the results mean something to the teacher himself, the school, the student and society which, in the last analysis, has set up the educational structure.

1.5 In education, measuring instruments are generally referred to as “tests”.

2. Qualities of a measuring instrument

Among the qualities of a test, whatever its nature, four are essential, namely, validity, reliability, objectivity and practicability. Others are also important, but they contribute in some degree to the qualities of validity and reliability.

2.1 Validity: the extent to which the test used really measures what it is intended to measure. No outside factors should be allowed to interfere with the manner in which the evaluation is carried out. For instance, in measuring the ability to synthesize, other factors such as style should not compete with the element to be measured so that what is finally measured is style rather than the ability to synthesize.

The notion of validity is a very relative one. It implies a concept of degree, i.e., one may speak of very valid, moderately valid or not very valid results.

The concept of validity is always specific for a particular subject. For example, results of a test on public health administration may be of very high validity for identification of the needs of the country and of little validity for a cost/benefit or cost/efficiency analysis.

Content validity is determined by the following question: will this test measure, or has it measured, the matter and the behaviour that it is intended to measure?

Predictive validity is determined by questions such as the following when the results of a test are to be used for predicting the performance of a student in another domain or in another situation:

To what extent do the results obtained in physiology help to predict performance in pathology?

To what extent do the results obtained during the pre-clinical years help in predicting the success of students during the clinical years?

2.2 Reliability: this is the consistency with which an instrument measures a given variable.

Reliability is always connected with a particular type of consistency: the consistency of the results in time; consistency of results according to the questions; consistency of the results according to the examiners.

Reliability is a necessary but not a sufficient condition for validity. In other words, valid results are necessarily reliable, but reliable results are not necessarily valid. Consequently, results that are not very reliable affect the degree of validity. Unlike validity, reliability is a strictly statistical concept and is expressed by means of a reliability coefficient or through the standard error of the measurements made.

Reliability can therefore be defined as the degree of confidence that can be placed in the results of an examination. It is the consistency with which a test gives the results expected.

2.3 Objectivity: this is the extent to which several independent and competent examiners agree on what constitutes an acceptable level of performance.

2.4 Practicability depends upon the time required to construct an examination, to administer and score it, and to interpret the results, and on its overall simplicity of use. It should never take precedence over the validity of the test.

3. Other qualities of a measuring instrument

3.1 Relevance: this is the degree to which the criteria established for selecting questions (items) so that they conform to the aims of the measuring instrument are respected. This notion is almost identical to the one of content validity; and the two qualities are established in a similar manner.

3.2 Equilibrium: achievement of the correct proportion among questions allocated to each of the objectives.

3.3 Equity: extent to which the questions set in the examination correspond to the teaching content.

3.4 Specificity: quality of a measuring instrument whereby an intelligent student who has not followed the teaching on the basis of which the instrument has been constructed will obtain a result equivalent to that expected by pure chance.

3.5 Discrimination: quality of each element of a measuring instrument which makes it possible to distinguish between good and poor students in relation to a given variable.

3.6 Efficiency: quality of a measuring instrument which ensures the greatest possible number of independent answers per unit of time.1

1 This definition of efficiency has a narrower meaning than the one given in the glossary (p. 6.05); it applies only to evaluation instruments (pp. 2.33 - 2.37).

3.7 Time: it is well known that a measuring instrument will be less reliable if it leads to the introduction of non-relevant factors (guessing, taking risks or chances, etc.) because the time allowed is too short.

3.8 Length: the reliability of a measuring instrument can be increased almost indefinitely (Spearman-Brown formula) by the addition of new questions equivalent to those constituting the original instrument.


The extent to which the instrument really measures what it is intended to measure.


The consistency with which an instrument measures a given variable.


The extent to which several independent and competent examiners agree on what constitutes an acceptable level of performance.


The overall simplicity of use of a test, both for test constructor and for students.


Relationships between the characteristics of an examination


The diagram on the next page, suggested by G. Cormier, represents an attempt to sum up the concepts of testing worked out by a number of authors. However, no diagram can give a perfect representation of reality and the purpose of the following lines is to explain rather than justify the diagram.

A very good treatment of all these concepts will be found in the book by Robert Ebel entitled Measuring Educational Achievement (Prentice Hall, 1965).

Validity and reliability

Ebel shows that “to be valid a measuring instrument (test) must be both relevant and reliable.” This assertion justifies the initial dichotomy of the diagram. It is, moreover, generally agreed that “a test can often, if not always, be made more valid if its reliability is increased.”

Validity and relevance

According to Ebel's comments, it seems that the concept of relevance corresponds more or less to that of validity of content. In any case, both are established in a similar manner (by consensus).

By definition, a question is relevant if it adds to the validity of the instrument, and an instrument is relevant if it respects the specifications (objectives and taxonomic levels) established during its preparation.

Relevance and equilibrium

It seems, moreover, that the concept of equilibrium is only a sub-category of the concept of relevance and that is why the diagram shows it as such.

Relevance and equity

It seems evident that if the instrument is constructed on the basis of a content itself determined by objectives, then it will be relevant by definition. If this is not done, then the instrument will not be relevant and consequently not valid. It is equitable in the first case and non-equitable in the second. However, an examination can be equitable without being relevant (or valid) when, although it corresponds well to the teaching content, the latter is not adequately derived from the objectives.

Equity, specificity and reliability

The diagram reflects the following implicit relationship: a test cannot be equitable if it is not first specific. Specificity, just like equity and for similar reasons, will affect the reliability of the results.

Reliability, discrimination, length, homogeneity (of questions) and heterogeneity (of students)

According to Ebel, reliability is influenced by the extent to which the questions (items) clearly distinguish competent from incompetent students, the number of items, the similarity of the items as regards their power to measure a given skill and the extent to which students are dissimilar with respect to that skill. The discriminating power of a question is directly influenced (see pages 4.73 - 4.75) by its level of difficulty. The mean discrimination index of an instrument will also be affected by the homogeneity of the questions and the heterogeneity of the students. From the comments made above it can be seen how equity and specificity will also influence the discriminating power of the instrument.


Try to answer questions 22 - 25 on p. 2.47 and check your answers on p, 2.48.

Relationships between characteristics of an examination1

1 As proposed by G. Cormier, Universitaval, Quebec.

N.B. Additional relationships to those suggested in this diagram can be established. The number of links has been kept to a minimum for the sake of clarity and to give a basic idea of the concept as a whole.




For each of the educational objectives you defined on page 1.68, describe two methods of evaluation that seem suitable to you for informing yourself and the student on the extent to which that objective has been achieved. Compare the two methods on the basis of the three criteria shown in the table below.

Examples of methods of evaluation for a class of 200 students


Make a differential diagnosis of anaemia based on the detailed haematological picture described in the patient's medical record.





Modified essay question. A series of 10 short questions based on patient's record as supplied to student (1 hour).





Student given patient's record (10 mins.) followed by 15 min. oral examination.




Methods of evaluation for a class of... students1

1 Choose a number of students that is realistic in your situation.


The student should be able to:






Check the meaning of the words validity, objectivity and practicability in the glossary, page 6.01.

For evaluation the essential quality is validity

but don't forget that for an educational system considered as a whole it is its relevance that is of primary importance


Evaluation is a matter for teamwork


The planning of an evaluation system is obviously not simple. It is a serious matter, for the quality of health care will partly depend on it. It has been stressed many times, moreover, that it should be a group activity. We have stated in the preceding pages that evaluation must be planned jointly; that implementation of any evaluation programme is the responsibility of the teachers, in collaboration with students and the administration; that evaluation carried out jointly by a group of teachers is less likely to be erroneous than when carried out by one person; and, finally, that critical analysis of a test by colleagues is essential to its sound construction.

This work performed jointly by a group of teachers calls for a coordinating mechanism. The terms of reference of each group and group member must be defined explicitly and known to all. The institution's higher authorities must provide the working groups and their members with the powers corresponding to the task to be accomplished.

The diagram on page 2.44 shows one type of organization and meets the needs of a given institution. Other types of organization can be envisaged, according to existing structures and local traditions. Now construct the type of organization that will be needed by your institution.

It will obviously be best if you can discuss the following exercises (pp. 2.41 and 2.42) with some of your colleagues. To help you to complete these exercises, take them in the following stages:

1. Read the instructions on objectives 8 and 9 on page 2.02. Then study pages 2.02 to 2.16, If you are taking part in a training workshop, ask the facilitator for some explanations, if necessary.

2. Do the exercise on page 2.09:

- if possible, exchange your proposals with some of your colleagues.

3. For each decision, select the most appropriate “means” of obtaining the information you need to make your decision:

- make a list of these “means” (page 2.41);

- if possible, exchange your proposals with some of your colleagues;

- if you are taking part in a workshop, draw up a joint list of “means”.

4. Specify the type of human resources needed to prepare and use these means:

- read pages 2.17 to 2.19;

- if possible, exchange your proposals with some of your colleagues;

- draw a diagram (page 2.42) which includes the terms of reference for each component element (who does what);

- do the exercise on page 2.43, on the basis of your diagram.

5. If possible, discuss your diagram with a few colleagues to make sure it has every chance of being suitable for use and used in your institution.



1. Draw up a list of the “means” which you think should be included in an evaluation system.

2. Show which of these are in practice already included in the evaluation of the educational programme in which you are involved.

Evaluation system

List of elements

Elements included



Do not change anything that works satisfactorily .... what is satisfactory to some, however, is not necessarily good enough for others. Teaching is a matter for teamwork.



Taking the list of means you have drawn up on the previous page for an evaluation system, draw a diagram to show the type of organization (commissions, committees, boards, etc., with a description of their functions) which would seem desirable (in the establishment where you are working) for introducing (or improving) an evaluation system capable of providing the data needed to assure you that the training establishment in which you are working is functioning efficiently.


Compare your diagram with the diagram on page 2.44.



Describe the obstacles you are liable to encounter in applying the organizational plan you have imagined on the previous page and indicate tactics for overcoming each of these obstacles.



Organizational diagram showing relationships between curriculum committee, evaluation committee and teaching units




(Check your answers on p. 2.48.)

Question 1

The main role of evaluation is: ________________

Question 2

The purpose of evaluation is to make a value judgement concerning:

A. Students and programmes.
B. Students and teachers.
C. Programmes and teachers.
D. Students.
E. None of the above.

Question 3

Thorndike's “Law of Effect” is based on the fact that:

A. Students learn better when they are motivated.
B. Students learn better when they play an active role.
C. Students are receptive when they understand the educational objectives which have been defined.
D. Students tend to engage in activities which have success associated with their results.
E. Students work better if the teacher makes an impression on them.

Questions 4 to 8

Indicate to which of the following each question refers:

A. Formative evaluation
B. Certifying evaluation
C. Both
D. Neither

Question 4

Its main aim is to inform the student on his/her progress.

Question 5

Does not preserve anonymity.

Question 6

Enables the teacher to decide to replace one programme by another.

Question 7

Justifies the decision to let a student move up from the second to the third year.

Question 8

Permits rank-ordering of students.

Questions 9 to 16

For each of the aims of student evaluation (list numbered 9 to 16, p. 2.18), indicate whether the appropriate measuring instrument will be of the certifying evaluation type (C) or both certifying and formative evaluation (CF).

Question 17

The four steps of the process of student evaluation are as follows:

1. ___________________________________

2. ___________________________________

3. ___________________________________

4. ___________________________________

Question 18

All the following steps except one are essential in constructing any measuring instrument.

A. Precise definition of all aspects of the type of competence to be measured.

B. Obtaining reliability and validity indices for the proposed instrument.

C. Making sure that the type of instrument chosen corresponds to the type of competence to be measured.

D. Making sure, by an explicit description of the acceptable level of performance, that the use of the measuring instrument will ensure objectivity

E. Determination of the particular behaviour expected from individuals who have or have not acquired the specified competence.

Question 19

When evaluating communication skills (domain of interpersonal relationships), all the following steps should be taken except one:

A. Describe specific types of behaviour showing a given affective level.

B. Describe explicit types of behaviour showing the absence of a given affective level.

C. Observe students in real situations enabling them to manifest the types of behaviour envisaged.

D. Obtain the agreement of a group of experts on the relationship between explicit types of behaviour and the affective level envisaged.

E. Obtain the students' opinions on the way in which they would behave in specific situations.

Question 20

The essential variable to be considered in evaluating the results of teaching is:

A. The student's performance.
B. The opinion of the teacher and his colleagues.
C. The opinion of the student regarding his performance.
D. The satisfaction of the teacher and the students.
E. The teacher's performance.

Question 21

Which of the following is not suitable for measurement by written examinations of the “objective” type:

A. Ability to recall precise facts.
B. Ability to solve problems.
C. Ability to make decisions.
D. Ability to communicate with the patient.
E. Ability to interpret data.

Questions 22 and 23

If the following qualities can be attributed to an examination:

A = Validity

B = Objectivity

C = Reliability

D = Specificity

E = Relevance

Question 22

What quality is obtained if a group of experts agree on what constitute good answers to a test?

Question 23

What quality implies that a test consistently measures the same thing?

Question 24

The following factors, except one, generally affect the reliability of a test:

A. Its objectivity.
B. The mean discrimination index of the test questions.
C. The homogeneity of the test.
D. The relevance of the test questions.
E. The number of questions in the test.

Question 25

Which of the following test criteria is influenced by all the others?

A. Reliability.
B. Validity.
C. Objectivity.
D. Specificity.
E. Relevance.

Suggested answers for the exercise on pages 2.45 - 2.47.


Suggested answer

If you did not find the correct answer, consult the following pages again



2.02, 2.06









2.15 to 2.19



2.15 to 2.19



2.15 to 2.19



2.15 to 2.19



2.15 to 2.19



2.15 to 2.19



2.15 to 2.19



2.15 to 2.19



2.15 to 2.19



2.15 to 2.19



2.15 to 2.19



2.15 to 2.19



2.15 to 2.19









2.07, 2.21, 2.35



2.08, 2.21



2.27, 2.30, 2.31



2.27, 2.30, 2.31



2.33 to 2.35



2.36, 2.37



2.36, 2.37

Personal notes