"Closer to buying car insurance than taking an exam." Credit: Guy Smallman/Getty

August 18, 2020   7 mins

Of all the surprises 2020 has thrown at us, I certainly didn’t expect to see teenagers with placards taking to the streets to shout “Fuck The Algorithm!” For me, an extra twist of irony was that they appeared to be outside the building where I took some of my Statistics exams as an Open University student, sitting at a tiny desk for three hours with pencils and calculator, wishing I had done more work before it was too late.

Exams are a blunt instrument. They assess performance on the day, not ability. But when they were cancelled, they left a gaping hole in an education system that depends on the grades they spit out. If only we had an oracle that could see into the mind of each student and judge them: a statistical model, objective, fair, and well-fed on data from every student in the country. So that’s what Ofqual built.

Then, around 40% of the A level grades awarded by the algorithm fell below the teacher predictions for the student in that subject. Cue teenage demonstrations and widespread political recrimination. But, contrary to what students may have expected, those teacher predictions were never the starting point for the awarded grades. In many cases, they didn’t even form part of the calculation.

Instead, the system was designed to give an overall distribution of grades that looked similar to previous years, with similar numbers of A*, A, and all the other grades for each subject — though they did allow more A and A* grades than usual. Ofqual even went so far as to check that the proportion of grades handed out to different subpopulations (by gender, ethnicity and deprivation, for example) would look similar to recent years. If your definition of fairness is that boys, or claimants of Free School Meals, won’t do demonstrably worse than last year, you should be happy.

Ofqual’s Direct Centre Performance model is based on the record of each centre (school or college) in the subject being assessed. Whatever the range and distribution of grades achieved by previous students over the last three years, that is the range of grades allocated to the class who would have taken A-levels in 2020. There was some adjustment, if your class has shown better (or worse) performance than its predecessors in GCSEs or other previous assessments, or if other changes would leave the national distribution of grades looking too different from previous years.

Meanwhile, each teacher was asked to rank each class from highest to lowest in expected achievement. That ranking, not predicted grades, was used to slot each student into the predetermined range of grades from A* to U. The exception to this was small groups, less than 15 in most cases, where Ofqual did resort to teachers’ predicted grades. That is why less popular subjects and smaller schools have seen less marking-down from expected grades.

Any one individual’s achievements so far, or their potential in the view of teachers who know them, had less influence on their eventual results than the attainments of others who attended the same school in past years. It’s closer to buying car insurance than taking an exam for which you have worked for nearly two years. Just enter postcode, make and model and we will predict your likelihood of making a claim, and hence your premium.

This wasn’t inevitable. “Any statistical algorithm embeds a range of judgments and choices; it is not simply a technically obvious and neutral procedure,” wrote the Royal Statistical Society (RSS) in a scathing statement published on 6 August. “Calibrating this year’s estimated grades to previous years’ exam results is one such choice. How to take account of evidence of individual students’ prior attainment is another. How to take account of uncertainty is another.”

I asked Professor Guy Nason, Chair in Statistics at Imperial College, London, and fellow of the RSS, what Ofqual could have done differently. He was surprised that the UK had not attempted any kind of socially-distanced, in-person assessments, as some other European countries had done. But, given that some kind of statistical grade allocation was needed, Guy pointed out some specific pitfalls that could have been avoided, if a wider range of experts had been involved at an earlier stage.

“Overall, I think they ignored, or in some cases, underestimated uncertainty in many steps of their process. For example, the teacher-provided rankings for students within a subject were treated as if they were correct, when, in all likelihood, they are subject to considerable uncertainty. So, for example, tied rankings were not permitted, which might have resulted in students being assigned different grades even though their Centre thought that they were indistinguishable.” He also thought it was unfair to assess students by a completely different method if they happened to be part of a small class.

Nason had serious concerns about how Ofqual tested the predictive ability of their algorithm. “You’ve got to run a fair test, one that runs under the same conditions as the real deal.”

The normal way to test a predictive algorithm is to see how good it is at predicting the past. That is, you run the program for the previous year and see how well its predictions match what happened in real life. Ofqual did that for 2019, but because teachers in previous years were not asked to rank students, Ofqual could not use 2019’s teacher-generated rank orders for a test run.

Instead, it used the rank order that emerged from the 2019 exam results. Which is like showing you can predict the results of a horse race by including data about the order in which the horses crossed the finish line in that same race. “If a test uses aspects of the same data that it is trying to predict, then it results in a false sense of security,” says Nason.

Even by including some of the data they were trying to predict, Ofqual found their accuracy in predicting exact grades ranged from two thirds for History to one in four for Italian. For most non-language subjects, over nine in 10 students would be within one grade of the true result, but 3% of Maths students (for example) missing a fair result by two grades or more adds up to a lot of teenagers. Over 10% of Further Maths students, ironically the only ones who can understand the tortuous workings of the algorithm that betrayed them, would be over a grade away from a fair result.

“Their algorithm’s predictability is, especially for some subjects, not that good anyway, but if you then realise that they are over-optimistic and cannot be trusted, then one has to really question whether the algorithm is fit for purpose,” says Guy.

This isn’t just hindsight talking. The RSS offered to nominate two distinguished experts to the Ofqual technical advisory group in March. Guy was one of them. But Ofqual wanted to impose a Non-Disclosure Agreement that would bar them from public comment on the model for five years, in direct contradiction of the Society’s commitment to transparency and public trust.

Ofqual also ignored the House of Commons Education Select Committee’s call to publish details of their methods before releasing the results. They might have been spared some of the post-hoc dissection of their work, before public outcry and political pain caused Monday’s abandonment of the algorithm.

Because the grading algorithm has been withdrawn for political, not statistical reasons.

Its workings seem to have hit harder the very students who already felt the cards were stacked against them, and the communities to whom this government promised “levelling up”. Students in state schools and FE colleges, especially, and more deprived students, saw their awarded grades fall well short of their teacher-predicted grades.

It’s no surprise that small teaching groups, and less popular A-levels like Law, Ancient Greek and Music, are more common at private schools, which insulated those students from being marked down. Nor that previous results in selective and private schools would have been higher, bequeathing a higher range of grades to this year’s cohort.

But the picture is messier than that. Historically, teachers in large state schools, and of more deprived students, have been more likely to over-predict. There may be good reasons for this. Teachers may consciously give a student the benefit of the doubt, figuring that it at least gives them a shot at a good university place. If they fall short, they can haggle later. If they exceed more honest expectations, it might be too late for them to raise their sights.

And in a high-attaining school where students routinely get A and A* exam results, there is not much headroom for over-optimism, unlike schools whose students walk away with the full range from A* to U.

Whatever the causes, that over-prediction means that every year state school students are more likely to find their exam results lower than their predicted grades. Universities are often flexible, recognising that it’s easier to get good grades in a good school, and that students who fought harder for OK results often do better at university than the ones who got good results in easier circumstances.

It’s bitter to be disappointed with your exam results. Perhaps, like me in those Stats exams, you turned over the paper and finally acknowledged, too late, how poorly your work matched the standards expected for the subject. But even if you were unlucky on the day with what was on the paper, or with your own state of mind, you still had your chance to do the best you could.

To find that a faceless system has allocated you to a lower grade, simply because your school hasn’t previously achieved much in this subject, looks like the epitome of systematic unfairness. Why did you bother to put in all that work, only to be pre-judged on the assumption that you’re homogenous with your older schoolmates?

That’s the embarrassing truth about algorithms. They are prejudice engines. Whenever an algorithm turns data from the past into a model, and projects that model into the future to be used for prediction, it is working on a number of assumptions. One of the more basic assumptions is that the future will look like the past and the present, in significant ways.

You may think you can beat the odds stacked against you by your low-attaining school, and your lack of extra-curricular extras, and your having to do homework perched on your bed in a shared bedroom, but the algorithm thinks otherwise. Isn’t it strange that we are repelled by prejudice in other contexts, but accept it when it’s automated?

Until now. Now school students shouting “Fuck The Algorithm” have forced a Government U-turn. Some of them seem to think the whole business was an elaborate ploy to punish the poor, instead of a clumsy attempt at automated fairness on a population scale. But some of them must be wondering what other algorithms are ignoring their human agency and excluding them from options in life because of what others did before them: Car insurance? Job adverts? Dating apps? Mortgage offers?

Some disgruntled Further Maths students will no doubt go on to write better algorithms, but that won’t solve the problem. As the RSS wrote to the Office of Statistical Regulation, “‘Fairness’ is not a statistical concept. Different and reasonable people will have different judgments about what is ‘fair’, both in general and about this particular issue.”

You don’t need Maths, or Further Maths, or even a 2:2 in Maths and Statistics, to question what assumptions are being designed into mathematical models that will affect your chances in life. Anyone can argue for their idea of what fairness means. Algorithms, and what we let them decide, are too important to be left to statisticians.

Timandra Harkness presents the BBC Radio 4 series, FutureProofing and How To Disagree. Her book, Big Data: Does Size Matter? is published by Bloomsbury Sigma.