Finally! A way to analyse NHS data from 17 million people

July 9 2020 - 11:06am

There’s a new paper out in Nature, led by Ben Goldacre of Bad Science fame, which is really interesting. It’s been available in preprint for a month, but it has just been properly published.

For starters, it’s interesting because it looks at causes of death among Covid patients, and confirms, for instance, higher rates of death among black and Asian patients that can’t be fully explained by pre-existing conditions. It also found a muddy and confusing picture when it comes to smoking and Covid. Current smokers seem to do better than non-smokers, but this is complicated by the fact that some non-smokers will be former smokers who have quit because of smoking-related disease.

When the researchers accounted for that, the protective effect seemed to disappear.

(This is a really interesting problem called collider bias. The study controlled for lung disease; comparing smokers with lung disease to never-smokers and former smokers with lung disease, and smokers without lung disease to never-smokers etc without. But of course smoking causes lung disease! So by controlling for it, you’re removing the exact thing you’re looking for. An analogy: If you looked at whether obesity shortens life, but controlled for whether someone has diabetes, you might find that it doesn’t very much, because obesity often kills people by giving them diabetes.)

But what makes it really interesting is not so much for its findings but for its size — 17 million subjects — which is possible because of its innovative methods. In Britain, we have an incredible scientific resource: a centrally controlled health system which in theory gives researchers access to data of the entire population.

In reality, it’s not so easy, because a lot of that data, even if pseudonymised, can be used to identify patients; so there are lots of vitally necessary safeguards which make it hard to share it for research purposes.

The Goldacre et al. study gets around that by letting the data stay exactly where it is. They developed software which could be given to the local NHS trusts that kept the data; the analysis was then done in-house. The raw, identifiable data stays with the trusts; the more high-level summaries are available for other researchers. The more abstract the summary, the less sensitive it is, and more widely it’s made available.

It’s also all completely open-source, so anyone can download, use, or check the code behind the OpenSAFELY system that runs it; and, Goldacre points out to me, it “essentially forces” everyone who wants to use it for their own research to similarly share their code, which is sadly necessary sometimes.

As Goldacre says in this more in-depth examination, it would be completely unthinkable to share the data of 40% of the UK population in the traditional, download-a-dataset-and-run-analyses way. But by breaking it apart like this, doing it in-house, and making the more abstracted versions available, he and his team were able to get access to 17 million people’s data without compromising their privacy. The findings haven’t shown us anything wildly surprising, but that’s not the point; the point is this is a way of making full use of one of the British health system’s most powerful scientific advantages.

Tom Chivers is a science writer. His second book, How to Read Numbers, is out now.

TomChivers

Join the discussion

Join like minded readers that support our journalism by becoming a paid subscriber

To join the discussion in the comments, become a paid subscriber.

Join like minded readers that support our journalism, read unlimited articles and enjoy other subscriber-only benefits.

Name*

Email*

26 Comments

Most Voted

Newest Oldest

Inline Feedbacks

View all comments

	This comment is spam
	This comment should be marked mature
	This comment is abusive
	This comment promotes self-harm
	Other

Finally! A way to analyse NHS data from 17 million people

By Tom Chivers

Latest from the Newsroom

Europe isn’t serious about peace in Ukraine