September 19, 2019

The most important sentence in The Art of Statistics: Learning from Data, a new book by David Spiegelhalter, is buried in a footnote on page 358. When it comes to statistical stories in the news, he says, he applies what he calls “the Groucho principle”, after Groucho Marx’s dictum that he would never join a club that would accept him as a member.

That’s because any statistic that is interesting enough to have become a news story, he claims, has probably passed through “so many filters that encourage distortion and selection”, that “the very fact that I am hearing a claim based on statistics is reason to disbelieve it”.

I have a similar rule of thumb, which I’ve said here before: if a statistic is interesting, it’s probably not true.

The trouble is that we can’t do without them. There are simply too many people and too many things in the world to understand what’s going on without simplifying them and aggregating them into statistics; we can’t think about every individual NHS patient, we need to think in terms of average survival rates or waiting times. We have to treat people like numbers if we want to do the best for those people, so we need to learn to be better at interpreting those numbers.

But statistics can be misleading — in some fascinating ways. In his book, Spiegelhalter, a Cambridge statistician and former president of the Royal Statistical Society, points some out; Simpson’s paradox, for instance. In 1996, the admissions data for five STEM subjects at Cambridge showed that men were more likely to be admitted: 24% of men’s applications were accepted, compared to 23% of women’s.

Is this evidence of bias? Well: possibly. But not, interestingly, of discrimination against women. In four of the five subjects – economics, engineering, computer science, medicine, and veterinary medicine – women were actually more likely to be accepted (in one, medicine, they were equal). Sometimes, that difference was quite stark – 16% of women compared to 12% of men were accepted onto the veterinary course; 32% of women against 26% of men into engineering.

How is this possible? Because medicine and veterinary medicine were much more popular, so your chances of being accepted were much lower; and because women were much more likely to apply for medical or veterinary degrees. By choosing the harder-to-access courses, women were less likely to be accepted overall, even though, individually, they were more likely to be accepted than a man applying for the same course. Similar findings have been seen elsewhere.

Another example is the “prosecutor’s fallacy”. Imagine a suspect is in court. If they’re innocent, there is only a one in a billion chance that their DNA would match the DNA found at the crime scene, and yet it does. Does this mean that there is only a one in a billion chance that they’re innocent? No; far from it. Nonetheless this has led to real women being wrongly convicted of child murder, after several of their children died of sudden infant death syndrome.

An analogous problem is cancer testing. Say your new cancer test is 95% accurate; it will answer the question “do you have cancer” correctly 95% of the time. If you test positive, do you have a 95% chance of having cancer?

Not at all! Imagine that 1% of people have cancer. You run your test on 10,000 people. Of the 100 people who do have cancer, it rightly tells 95 of them that they do have cancer. Of the 9,900 people who don’t have cancer, it rightly tells 9,405 that they don’t have cancer. But this means that it has wrongly told 495 people that they do have cancer. If you test positive for cancer, your chances of having cancer are not 95%, but 95 in 590, or about 16%. The DNA-test situation is exactly the same: your test is “accurate” 999,999,999 times out of a billion, but will still regularly be wrong, because the thing it is looking for (the specific guilty person) is very rare.

These are just a couple of the reasons why even straightforward-sounding statistics can be misleading. They’re not even the big reasons – they’re just a couple that I thought you might not have heard of.

Things like statistical bias, small sample sizes, or multiple analyses of the same data will have more of an impact, as well as the Groucho Principle that the most dramatic findings are both more likely to be well-publicised and more likely to be false. But they’re illustrative of the fact that you need to be very careful of almost any statistical story you read.

At this point, I should admit that I have a dirty little secret, which is that I am not very good at maths. I mean, I’m fine. I’m OK. I got an A at GCSE, in 1997. But I’m not good. This is a problem, because I write an awful lot about statistics. Spiegelhalter tries to make the case that you don’t need to be very good at maths to be good at statistics, and he sort of succeeds – quite a lot of it you can grasp via the concepts, or through visualisations. But still, as an indifferent mathematician, I often found it hard to follow him.

As a non-mathematician, I have a few shortcuts for working out whether a statistic is worth believing, which seem to have done all right for me so far. One, which Spiegelhalter stresses, is that often the best statistical analysis you can do is simply visualising the data. There was a bit of a recent kerfuffle about suicides among girls and young women going up 83% since 2012; but simply looking at the ONS chart showed that the numbers were small, the data was noisy, and the only way you got the 83% figure was by choosing the lowest year on record. (It’s an old trick.)

Another useful hack is to ask “is this a big number?” Lots of statistics are presented as bald numbers: “X people have died”. But out of how many? If 25 people have died of cancer out of a population of 50, that’s awful; if 25 people have died of cancer out of a population of 50 million, that’s a statistical miracle. That’s one reason why I’m always wary of stories that say, for instance, “10 or more black transgender women have been killed this year” without telling me how many black transgender women there actually are, and whether that is more or less than we should expect.

Similarly, if a story tells you that something has “gone up by 25%”, or that something “increases your risk by 90%”, without telling you what it’s increased it from and to – if it gives you the “relative risk” but not the “absolute risk”, in technical terms – that’s an indicator that it may be worth ignoring. If something doubles your risk from one in 100 million to one in 50 million, you probably don’t care.

A third is asking “can I be sure what’s causing this?” Spiegelhalter dedicates a chapter to how you establish causality. It is amazingly hard to do this in the absence of large-scale randomised controlled trials. Austin Bradford Hill, one of the scientists – with Richard Doll – who first established that smoking causes lung cancer, suggested ways of doing so; looking for plausible mechanisms, making sure that the supposed effect follows the supposed cause, seeing if there’s a “dose response” – that is, the more of the cause there is, the more of the effect you see – and, importantly, making sure that the effect is sufficiently large that it couldn’t be a coincidence or some statistical confounding.

If you’re looking to see if something (say, a rise in teen suicides) is caused by something else (say, social media) then it’s worth looking to see if any rise in teen suicides follows the growth in social media, and whether each doubling in social media use leads to a consistent increase in suicide rate.

But also, it’s important to see whether the suicide rate tends to jump around a lot anyway. If it does, then perhaps there’s nothing to explain; it just happened to be low one year and high another. (Spiegelhalter has a useful trick for seeing whether you’d expect to see the variation you see, such as using the mean averages to perform a “Poisson distribution”, but just looking at a graph and seeing whether it bounces around a lot is a good start.)

The statistician George Box said: “All models are wrong, but some are useful.” When you’re reducing millions of people and events to a few numbers, you’re building a model of the world, and that model must be simpler than the world, so it will be wrong in some way. You’re always making assumptions and drawing arbitrary lines.

For instance, we talk about the “murder rate” in Britain; statisticians treat it like an arbitrary event that happens at random to people, like decaying subatomic particles, with a 0.0008% or so chance of it happening to any given person per year. But that’s not how it works. Each murder is the result of real people making real decisions, not some lightning bolt out of a clear blue sky; it just happens that in a large enough population, non-random events can be treated as though they were random.

We can’t do it any other way. We can’t look at every person, and you can’t possibly predict every individual event, whether it’s the outcomes of heart operations or teen suicides or murders. You have to aggregate them, round off the corners, make simplifying assumptions; we have to treat people like numbers. You’ll get it wrong a lot of the time, and you have to be immensely cautious about any conclusions you draw.

There is one simple step, though, which will help a lot of the time. It’s a bit nihilistic, perhaps, but it’s powerful. And that is the Groucho principle that we discussed at the beginning: if the statistic makes a headline, it’s probably wrong.