A 'correlation' is not quite what it seems...
Another day, another exciting-looking correlation in the world of Covid-19! Researchers at Yale have published a preprint looking at the correlation between the amount of novel coronavirus in sewage and the number of hospital admissions three days later, and they found an almost perfect match.
Imagine you’re measuring two numbers and you want to see how much one affects the other. Let’s say height and weight. So you go and ask random people in the street how tall they are and how heavy they are. You’ll notice that, on average, tall people are heavier. But sometimes, you get tall skinny people or short fat people, so the correlation isn’t perfect.
Like what you’re reading? Get the free UnHerd daily email
Already registered? Sign in
In statistics, correlations are measured in a number called R. (Not that one.) If the correlation is perfect, so for every 1% increase in height you get a 1% increase in weight, then R=1; if there is no correlation, and height and weight vary totally randomly with no link to each other, then R=0. (It can also be negatively correlated: if one goes up, the other goes down.) You can get a sense of what different R values look like by playing this game.
The Yale researchers found that coronavirus in poo on day 0 correlated with hospital admissions on day 3 with an R=0.99. That is ridiculous. It is literally saying “if coronavirus levels in sewage go up by 300% on Sunday, you should see almost exactly three times as many people admitted to hospital on Wednesday.”
It got a lot of attention, but in any human-behaviour-related study, a correlation of 0.99 is frankly unbelievable. As Alex Danvers says in a good blog post which annoyingly scooped me as I was thinking about writing this, in psychology, an R of .1 is pretty good. Even height and weight probably only correlate at about 0.7.
Here’s what went wrong, as pointed out (and explained to me) by the indefatigable bad-science-debunker Nick Brown. They weren’t checking the correlation between the raw numbers — they were checking the correlation between correlations.
Brown uses an analogy: imagine instead of measuring height vs weight, we’d measured height vs the last digit on your National Insurance number. We’d find no correlation: R=0.
Then imagine we measured weight vs the fourth digit on your NI number. We’d find no correlation: R=0.
But then imagine we checked the correlation between those two correlations. We’d find two flat lines! They correlate perfectly! R=1!
This isn’t exactly what’s gone on, but it illustrates it. They’re comparing the correlation of virus-in-poo to time with the correlation of hospital admissions to time, rather than virus-in-poo to hospital admissions directly. That smooths the curves and makes it look like a closer correlation.
(There are other problems but this post is too long already.)
Brown pointed all this out to the authors, and they’ve taken down one hugely viral tweet and are looking to correct the preprint. In his own look at the data he finds a correlation of between 0.14 and 0.4. That’s still important and useful, if it’s real! But you can’t say “if virus in poo goes up 4x today, you’ll see a fourfold increase in hospital admissions in three days’ time.”
Addendum: The Guardian raises some very serious concerns with the Lancet study into hydroxychloroquine that I mentioned in my last piece. It’s worth noting because while it undermines my specific point about hydroxychloroquine, it very much supports my case for being wary of fast science in the pandemic.