July 9, 2020 - 11:06am

There’s a new paper out in Nature, led by Ben Goldacre of Bad Science fame, which is really interesting. It’s been available in preprint for a month, but it has just been properly published.

For starters, it’s interesting because it looks at causes of death among Covid patients, and confirms, for instance, higher rates of death among black and Asian patients that can’t be fully explained by pre-existing conditions. It also found a muddy and confusing picture when it comes to smoking and Covid. Current smokers seem to do better than non-smokers, but this is complicated by the fact that some non-smokers will be former smokers who have quit because of smoking-related disease.

When the researchers accounted for that, the protective effect seemed to disappear.

(This is a really interesting problem called collider bias. The study controlled for lung disease; comparing smokers with lung disease to never-smokers and former smokers with lung disease, and smokers without lung disease to never-smokers etc without. But of course smoking causes lung disease! So by controlling for it, you’re removing the exact thing you’re looking for. An analogy: If you looked at whether obesity shortens life, but controlled for whether someone has diabetes, you might find that it doesn’t very much, because obesity often kills people by giving them diabetes.)

But what makes it really interesting is not so much for its findings but for its size — 17 million subjects — which is possible because of its innovative methods. In Britain, we have an incredible scientific resource: a centrally controlled health system which in theory gives researchers access to data of the entire population.

In reality, it’s not so easy, because a lot of that data, even if pseudonymised, can be used to identify patients; so there are lots of vitally necessary safeguards which make it hard to share it for research purposes.

The Goldacre et al. study gets around that by letting the data stay exactly where it is. They developed software which could be given to the local NHS trusts that kept the data; the analysis was then done in-house. The raw, identifiable data stays with the trusts; the more high-level summaries are available for other researchers. The more abstract the summary, the less sensitive it is, and more widely it’s made available.

It’s also all completely open-source, so anyone can download, use, or check the code behind the OpenSAFELY system that runs it; and, Goldacre points out to me, it “essentially forces” everyone who wants to use it for their own research to similarly share their code, which is sadly necessary sometimes.

As Goldacre says in this more in-depth examination, it would be completely unthinkable to share the data of 40% of the UK population in the traditional, download-a-dataset-and-run-analyses way. But by breaking it apart like this, doing it in-house, and making the more abstracted versions available, he and his team were able to get access to 17 million people’s data without compromising their privacy. The findings haven’t shown us anything wildly surprising, but that’s not the point; the point is this is a way of making full use of one of the British health system’s most powerful scientific advantages.

Tom Chivers is a science writer. His second book, How to Read Numbers, is out now.