Do people with severe depression have a right to accurate information about antidepressants? I suspect most people would answer “yes”. There is a general understanding that individuals who suffer from medical conditions are in a vulnerable position, making them susceptible to misinformation. There is also increased awareness of the influence that the profit motive can have on how medical research is funded, undertaken and communicated to the public.

But for some reason, this basic principle doesn’t seem to apply to the hyper-politicised subject of gender medicine. On one side, Republican states are attempting to ban youth gender medicine — and, in some cases, to dial back access to adult gender medicine. On the other, liberals maintain that there is solid evidence for these treatments, and that only an ignorant person could suggest otherwise.

Whether or not you agree with the GOP’s stance (I do not), the latter view is simply false. The trajectory of youth gender medicine in nations with nationalised healthcare systems has been relatively straightforward: these countries keep conducting careful reviews of the evidence for puberty blockers and hormones, and they keep finding that there is very little such evidence to speak of. That was the conclusion in Sweden, Finland, the UK, and, most recently, Norway. As a recent headline in The Economist had it: “The evidence to support medicalised gender transitions in adolescents is worryingly weak.”

Yet despite this evidentiary crisis in Europe, and despite multiple scandals vividly demonstrating the downside of administering these treatments in a careless way, liberal institutions in the US have only become more enthusiastic about them. In recent years, everyone from Jon Stewart and John Oliver to reporters and pundits at the New York Times, The Washington Post and NPR have exaggerated the evidence for these interventions.

The logic seems to be that if activists, doctors and journalists repeat “The evidence is great!” enough times, regardless of whether the evidence actually is great, the controversy will go away — as though the state of Arkansas could be shamed into reversing its policy on trans youth because Jon Stewart made fun of them. Meanwhile, as I can tell you from experience, if you openly question these treatments or highlight just how little we know about them, you’re going to have a bad time.

But look a little closer, and it swiftly becomes clear that the evidence for both adult and youth gender medicine is frequently drawn from alarmingly low-quality studies. Almost invariably, when you examine the latest study to go viral, there’s much less there than meets the eye — whether because of serious overhyping and questionable statistical choices on the part of the researchers, outright missing data, flawed survey instruments, more missing data, or just generally beyond-broken methods.

Since any individual study or group of studies can suffer from these issues, serious researchers know that you can’t just take a few that point in the right direction and herald them as evidence. Rather, you need to sum up the available evidence while also accounting for its quality. This is what European countries have done, and they have all come to roughly the same conclusion: the evidence supporting these treatments isn’t there.

But even at the level of sweeping summaries, America’s conclusions are often distorted. A prime example came in a recent New York Times column by Marci Bowers, a leading gender surgeon and the president of the World Professional Association for Transgender Health (WPATH). Bowers paints a very rosy picture of the evidence base:

“Decades of medical experience and research since has found that when patients are treated for gender dysphoria, their self-esteem grows and their stress, anxiety, substance use and suicidality decrease. In 2018, Cornell University’s Center for the Study of Inequality released a comprehensive literature review finding that gender transition, including hormones and surgery, ‘improves the well-being of transgender people’. Nathaniel Frank, the project’s director, said that ‘a consensus like this is rare in social science’.

“The Cornell review also found that regret… became even less common as surgical quality and social support improved. All procedures in medicine and surgery inspire some percentage of regret. But a study published in 2021 found that fewer than 1% of those who have received gender-affirming surgery say they regret their decision to do so… A separate analysis of a survey of more than 27,000 transgender and gender-diverse adults found that the vast majority of those who detransition from medical affirming treatment said they did so because of external factors (such as family pressure, financial reasons or a loss of access to care), not because they had been misdiagnosed or their gender identities had changed.”

Here we have a leading expert (Bowers) citing a leading institution (Cornell) and relating astonishing claims (what medical procedure has a 1% regret rate?). The case appears to be closed — until you actually click the links and read Bowers’s sources. (Bowers and WPATH did not return emailed interview requests.)

Let’s start with Cornell’s data. According to a summary at its “What We Know Project:

“We conducted a systematic literature review of all peer-reviewed articles published in English between 1991 and June 2017 that assess the effect of gender transition on transgender well-being. We identified 55 studies that consist of primary research on this topic, of which 51 (93%) found that gender transition improves the overall well-being of transgender people, while 4 (7%) report mixed or null findings. We found no studies concluding that gender transition causes overall harm.”

If you are familiar with systematic literature reviews, you will find the above unusual. Researchers don’t generally ask whether a procedure works or not in such a vague a manner, then tally up the results. To usefully gauge the level of evidence, a review has to carefully define its research questions, and factor in the potential biases of the existing studies. The Cornell project does none of this.

I emailed Gordon Guyatt, one of the godfathers of the so-called evidence-based medicine movement, to ask him whether he thought the Cornell project qualified as a systematic literature review. His response was: “It meets criteria for a profoundly flawed systematic review!” When we later spoke, he explained why he didn’t trust it. “Presumably, they are trying to make a causal connection between what the patients received and their outcomes,” he said. “That is not possible unless one has a comparator.” In other words, if you’re only tracking people who received a treatment, and don’t compare their outcomes to another group not receiving the treatment, you simply can’t learn that much. Guyatt offers the example of someone taking hormones and saying afterwards that they feel better. “That does not mean that the hormones have anything to do with your feeling good.” 

This is a very basic, very well-understood problem in both medical and social-scientific research. If all you have is before-and-after measurements of how someone who received a treatment changed over time, there are all sorts of potential confounds, from the placebo effect to regression towards the mean to the possibility that receiving the treatment coincided with some other salutary intervention, such as therapy, that wasn’t accounted for.

Because the Cornell team made no effort to even evaluate the risk of bias in the individual studies it evaluated, the final product tells us very little. It’s roughly analogous to coming upon a pile of coins and trying to determine its worth simply by counting how many coins there are, rather than sorting the pile by denomination. When I raised this with Nathaniel Frank, the head of the Cornell project, he said via email that “we don’t publish traditional systematic reviews”, but rather web summaries of important research questions. So the first words of its overview might confuse readers: “We conducted a systematic literature review.” 

If Bowers had wanted to cite a carefully conducted, peer-reviewed systematic review of the gender medicine literature, she actually had one at her fingertips: her own organisation, WPATH, funded one a few years ago. The results, published in the Journal of the Endocrine Society in 2021, revealed that there is almost no high-quality evidence in this field of medicine. After they summarised every study they could find that met certain quality criteria, and applied Cochrane guidelines to evaluate their quality, the authors could find only low-strength evidence to support the idea that hormones improve quality of life, depression, and anxiety for trans people. Low means, here, that the authors “have limited confidence that the estimate of effect lies close to the true effect for this outcome. The body of evidence has major or numerous deficiencies (or both).” Meanwhile, there wasn’t enough evidence to render any verdict on the quality of the evidence supporting the idea that hormones reduce the risk of death by suicide, which is an exceptionally common claim.

Oddly, though, the authors of this systematic review conclude by writing that the benefits of these treatments “make hormone therapy an essential component of care that promotes the health and well-being of transgender people”. That claim completely clashes with their substantive findings about the quality of the evidence. So, when Bowers cited the Cornell project, she was citing a review that is of very limited evidentiary value — while also ignoring a much more professionally conducted, and much more pessimistic, though strangely concluded, review that her own organisation paid for.

But what about the study which, she claims, “found that fewer than 1% of those who have received gender-affirming surgery say they regret their decision to do so”? Here’s where things get downright weird.

The study in question, published in 2021 in the journal Plastic and Reconstructive Surgery Global Open, has dozens of errors that its nine authors and editors have refused to correct. Indeed, it appears to have been executed and published to such an unprofessional standard that one might ask why it hasn’t been retracted entirely. 

Before we get into all that, though, it’s worth pointing out that even if it had been competently conducted, the review could not have provided us with a reliable estimate of the regret rate following gender-affirming surgery: the studies it meta-analyses are just too weak. Many of those included did not actually contact people who had undergone surgery to ask them if they regretted it; rather, the authors searched medical records for mentions of regret and/or for other evidence of surgical reversals. Yet this method is inevitably going to underestimate the number of regretters, because plenty of people regret a procedure without going through the trouble of either reversing it or informing the doctor who performed it. In one study of detransitioners — albeit one focusing on a fairly small and non-random online sample — three quarters of them said they did not inform their clinicians that they had detransitioned.

The studies included in this review also failed to follow up with a very large number of patients. The meta-analysis had a total sample size of about 5,600; the largest study, with a sample size of 2,627 — so a little under half the entire sample — had a loss-to-follow-up rate of 36%. If you’re losing track of a third of your patients, you obviously don’t really know how they’re doing and can’t make any strong claims about their regret rates. And yet, the authors don’t mention the loss-to-follow-up issue anywhere in their paper. No version of this meta-analysis, then, was likely to provide a reliable estimate of the regret rate for gender-affirming surgery.

Even so, the version that was published was particularly disastrous. Independent researcher J.L. Cederblom summed it up: “What are these numbers? These are all wrong… And these weren’t even simple one-off errors — instead different tables disagreed with each other. The metaphor that comes to mind is drunk driving.”

To take one example, the authors initially reported that the aforementioned largest paper in their meta-analysis had a sample size of 4,863. But they misread it — the true figure was actually only 2,627. They also misstated other aspects of that report, such as how regret was investigated (they said it was via questionnaire but it was via medical records search) and the age of the sample (they said it included some juveniles, but it did not).

Not all the errors were significant, but they were remarkably numerous. And because of the abundance of issues, the paper attracted the attention of other researchers. “In light of these numerous issues affecting study quality and data analysis, [the authors’] conclusion that ‘our study has shown a very low percentage of regret in TGNB population after GAS’ is, in our opinion, unsupported and potentially inaccurate,” wrote two critics, Pablo Expósito-Campos and Roberto D’Angelo, in a letter to the editor that the journal subsequently published. In her own letter, the researcher Susan Bewley highlighted what appears to be an absence of vital information about the authors’ method of putting together the meta-analysis. 

The authors and the editors decided to simply not correct any of this. They did publish an erratum, in which they republished seven tables that still contained errors, while maintaining that all those errors had no impact on the paper’s takeaway findings. But the paper itself remains published, in its original form, complete with those 2,200 ghost-patients inflating the sample size.

Bewley and Cederblom have continued to ask the journal to reveal the process that led to the paper getting published, and to address why so many of the errors remain uncorrected. In an email in January to Bewley, Aaron Weinstein, its editorial director, claimed that because critical letters to the editor had been published, and because the corrected data was reanalysed by a statistical expert, “the Publisher and the ASPS [American Society of Plastic Surgeons] feel that PRS Global Open has done due diligence on this article and this case is closed”. He also claimed, curiously, that he had no power to force the authors to address the many serious remaining questions raised by the paper’s critics, saying “there is no precedent for an editorial office to do so”. Neither Weinstein nor the paper’s corresponding author, Oscar Manrique, responded to my emailed requests for comments.

Finally, there is Bowers’s claim that “a separate analysis of a survey of more than 27,000 transgender and gender-diverse adults found that the vast majority of those who detransition from medical affirming treatment said they did so because of external factors”. This is technically true, but is also rather misleading because the survey in question — the 2015 United States Transgender Survey (which has profound sampling issues) — was of currently transgender people. It says so in the first sentence of the executive summary. Research based on this survey obviously can’t provide us with any reliable information about why people detransition, because it is not a survey of detransitioners. If you want to know how often people detransition, you need to follow large groups of trans people over time and check in to see if they still identify that way later on — and we don’t have high-quality research on that front.

It’s also worth bearing in mind that the vast majority of studies being discussed here concern adults, while the legislative discussion mostly centres on adolescents. The most recent version of WPATH’s Standards of Care is very open about the lack of evidence when it comes to the latter: “Despite the slowly growing body of evidence supporting the effectiveness of early medical intervention, the number of studies is still low, and there are few outcome studies that follow youth into adulthood. Therefore, a systematic review regarding outcomes of treatment in adolescents is not possible.” Again, WPATH is Bowers’s own organisation — surely she is familiar with its output?

Despite the backbreaking errors of that nine-authored paper, the severe limitations of the Cornell review, and the near-utter-irrelevance of the United States Transgender Survey, all three are chronically trotted out as evidence that we know transgender medicine is profoundly helpful, or that detransition or regret are rare — or both. It’s frustrating enough that these lacklustre arguments are constantly made on social media, where all too many people get their scientific information. But what’s worse is that many journalists have perpetuated this sad state of affairs. A cursory Google search will reveal that these three works have been treated as solid evidence by the Associated Press, Slate, Slate again, The Daily Beast, Scientific American and other outlets. The NYT, meanwhile, further publicised Cornell’s half-baked systematic review by giving Nathaniel Frank a whole column to tout its misleading findings back in 2018.

Why does such low-quality work slip through? The answer is straightforward: because it appears, if you don’t read it too closely, or if you are unfamiliar with the basic concepts of evidence-based medicine, to support the liberal view that these treatments are wonderful and shouldn’t be questioned, let alone banned. That’s enough for most people, who are less concerned with whether what they are sharing is accurate than whether it can help with ongoing, high-stakes political fights. 

But you’re not being a good ally to trans people if you disseminate shoddy evidence about medicine they might seek. Whatever happens in the red states seeking to ban these treatments, transgender people need to make difficult healthcare choices, many of which can be ruinously expensive. And yet, if you call for the same standards to be applied to gender medicine that are applied to antidepressants, you’ll likely be told you don’t care about trans people.

As Gordon Guyatt, who has done an enormous amount to increase the evidentiary standards of the medical establishment, told me: “You’re doing harm to transgender people if you don’t question the evidence. I believe that people making any health decisions should know about what the best evidence is, and what the quality of evidence is. So by pretending things are not the way they are — I don’t see how you’re not harming people.”