The Problem of Assessment

Even directly observed work-place assessment is not completely valid. The fact of being observed can make a difference to performance, for example. Anyone whose teaching has been observed can testify to this—even if you do nothing differently, the students may be inhibited, or "play up", or try to be "good". And the further one goes from the practical situation, the more problematic assessment becomes. 

Let us assume that out of a given group of students, it is possible to tell who is competent in a given area and who is not. This is an entirely idealistic construction, of course, because whatever basis we construct for making the judgement, it will itself be a form of assessment. (Note, too, that "competence" as used here, is merely a general term covering knowledge as well as practice, and has no necessary implications of a competence-based assessment scheme.)

In practice, when setting an examination in a "hard" subject (as opposed to a "soft" one—the STEM disciplines, for example, are hard in this sense, and indeed in others, too), we decide in advance how difficult to make it, often with a view to how many "ought" to pass, or norm-referencing, discussed here and here. When we set a coursework essay, of course, we often make up our minds how strictly to mark only when we sit down actually to do it. It has been argued that such an norm-referenced approach underlies "grade inflation" in school examinations in the UK, and the discrediting of GCSE and A level qualifications. (News story here.)

Such an idealistic construction is known as a "gold standard". In the case in the diagram, about 80% are competent (or "deserve to pass") and about 20% aren't.

Assess1.gif (2675 bytes)

Let us further assume that we have an assessment scheme which is highly valid. It comes to the "right" answer about 80% of the time. Again, this is idealistic, because we rarely have a clue as to the quantifiable validity of such a scheme

 Assess2.gif (2544 bytes)

What happens when we use this scheme with our group of students?

We end up with:

but also:

False Positives are also known as "Type l" errors—finding something when it is not really there; contrasted with "Type ll" errors—failing to find something which is there. The most extensive literature on this comes from the field of medical testing and diagnosis; there is plenty of formal explanation on the web, but it can be very confusing. See here for an accessible explanation for non-mathematicians.

This is both unfair, and a potentially serious technical problem. I would not be happy if I felt that my doctor, or the pilot flying my airliner, had qualified as a False Positive! 

There is of course a solution: raise the "pass" threshold. Unfortunately it's wrong. Whichever way you look at it, the validity of the scheme remains at 80% — you haven't made it any higher. And because the number of inaccurate judgements was a relatively small proportion to begin, a large shift in the proportion of overall passes and fails has only a tiny effect on the "false" outcomes. And raising the threshold in practice often means taking into account factors which are not strictly relevant or valid, such as the quality of presentation of assignments, or whether the views expressed accord with those of the assessor: there is a high probability that this may operate selectively to the disadvantage of "non-standard" students, and violate equal opportunities.

See here for more on equal opportunities (or "diversity and inclusivity") issues

What's the answer?

There isn't a complete one. We have to live with it, and make strenuous efforts to improve validity. In particular, do not rely on a single assessment exercise: use a range of approaches emphasising different components of the taught material. Bloom's taxonomy is a useful tool to help you analyse your taught material and to match up potential assessment approaches. You can then undertake a more detailed evaluation of the strengths and weaknesses of your repertoire of techniques, and uncover where your errors are tending to occur.

The business of false negatives and false positives goes much further than this, of course. It is at the root of Bayes' (or Laplace's) theorem about making predictions on the basis of imperfect knowledge, and the uncertainties of, for example, diagnostic tests in medicine as well as educational assessment.


Forms of Assessment

Revised and up-dated 18 March 2013,
with thanks to Chris Copley for pointing out errors in the earlier version

To reference this page copy and paste the text below:

Atherton J S (2013) Learning and Teaching; [On-line: UK] retrieved from

Original material by James Atherton: last up-dated overall 10 February 2013

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 Unported License.

Search and associated sites:

Delicious Save this on Delicious        Click here to send to a friend     Print

This site is independent and self-funded, although the contribution of the Higher Education Academy to its development via the award of a National Teaching Fellowship, in 2004 has been greatly appreciated. The site does not accept advertising or sponsorship (apart from what I am lumbered with on the reports from the site Search facility above), and invitations/proposals/demands will be ignored, as will SEO spam. I am of course not responsible for the content of any external links; any endorsement is on the basis only of my quixotic judgement. Suggestions for new pages and corrections of errors or reasonable disagreements are of course always welcome. I am not on FaceBook or LinkedIn.

Back to top