The mad dash accelerated as quickly as the pandemic. Researchers sprinted to see whether artificial intelligence could unravel Covid-19’s many secrets — and for good reason. There was a shortage of tests and treatments for a skyrocketing number of patients. Maybe AI could detect the illness earlier on lung images, and predict which patients were most likely to become severely ill.
Hundreds of studies flooded onto preprint servers and into medical journals claiming to demonstrate AI’s ability to perform those tasks with high accuracy. It wasn’t until many months later that a research team from the University of Cambridge in England began examining the models — more than 400 in total — and reached a much different conclusion: Every single one was fatally flawed.
“It was a real eye-opener and quite surprising how many methodological flaws there have been,” said Ian Selby, a radiologist and member of the research team. The review found the algorithms were often trained on small, single-origin data samples with limited diversity; some even reused the same data for training and testing, a cardinal sin that can lead to misleadingly impressive performance. Selby, a believer in AI’s long-term potential, said the pervasiveness of errors and ambiguities makes it hard to have faith in published claims.
“You end up with this quite polluted area of research,” he said. “You read a lot of papers and your natural instinct is not to want to trust them.”
The problems are not limited to Covid-19 research. Machine learning, a subset of AI driving billions of dollars of investment in the field of medicine, is facing a credibility crisis. An ever-growing list of papers rely on limited or low-quality data, fail to specify their training approach and statistical methods, and don’t test whether they will work for people of different races, genders, ages, and geographies.
These shortcomings arise from an array of systematic challenges in machine learning research. Intense competition results in tighter publishing deadlines, and heavily cited preprint articles may not always undergo rigorous peer review. In some cases, as was the situation with Covid-19 models, the demand for speedy solutions may also limit the rigor of the experiments.
By far the biggest problem — and the trickiest to solve — points to machine learning’s Catch-22: There are few large, diverse data sets to train and validate a new tool on, and many of those that do exist are kept confidential for legal or business reasons. But that means that outside researchers have no data to turn to test a paper’s claims or compare it to similar work, a key step in vetting any scientific research.
The failure to test AI models on data from different sources — a process known as external validation — is common in studies published on preprint servers and in leading medical journals. It often results in an algorithm that looks highly accurate in a study, but fails to perform at the same level when exposed to the variables of the real world, such as different types of patients or imaging scans obtained with different devices.
“If the performance results are not reproduced in clinical care to the standard that was used during [a study], then we risk approving algorithms that we can’t trust,” said Matthew McDermott, a researcher at the Massachusetts Institute of Technology who co-authored a recent paper on these problems. “They may actually end up worsening patient care.”
This may already be happening with a wide array of products used to help treat serious illnesses such as heart disease and cancer. A recent STAT investigation found that only 73 of 161 AI products approved by the federal Food and Drug Administration publicly disclosed the amount of data used to validate the product, and just seven reported the racial makeup for their study populations. Even the sources of the data were almost never given.
Those findings were echoed in a paper by Stanford researchers who highlighted the lack of prospective studies, or studies that examine future outcomes, conducted on even higher-risk AI products cleared by the FDA. They also noted that most AI devices were evaluated at a small number of sites and that only a tiny fraction reported how the AI performed in different demographic groups.
“We would like the AI to work responsibly and reliably for different patients in different hospitals,” said James Zou, a professor of biomedical data science at Stanford and co-author of the paper. “So it’s especially important to be able to evaluate and test the algorithm across these diverse kinds of data.”
The review conducted by the University of Cambridge found that many studies not only lacked external validation, but also neglected to specify the data sources used or details on how their AI models were trained. All but 62 of the more than 400 papers failed to pass an initial quality screening based on those omissions and other lapses.
Even those that survived the initial screening suffered from multiple shortcomings— 55 of those 62 papers were found to be at high risk of bias due to a variety of problems, including reliance on public datasets where many images suspected to represent Covid-19 are not confirmed to be positive cases. A few AI models trained to diagnose adult Covid-19 cases on chest X-rays were tested on images of pediatric patients with pneumonia.
“The [pediatric images] were often of children below the age of 5, who have massive anatomical differences compared to adults, so it is absolutely no surprise that these models had really good results in picking out Covid versus non-Covid,” said Selby. “The patients looked completely different on the chest X-ray regardless of Covid status.”
The researchers found significant flaws with papers published on preprint servers as well as those published in journals that impose more scrutiny through peer review. The peer-review process can fail for a variety of reasons, including reviewers lacking a deep knowledge about machine learning methodology or bias towards prominent institutions or companies that results in superficial reviews of their papers. A larger problem is a lack of consensus standards for evaluating machine learning research in medicine, although that is beginning to change. The University of Cambridge researchers used a methodology checklist known as CLAIM, which establishes a common set of criteria for authors and reviewers.
“We tried in our paper to point out the necessity of the checklists,” Selby said. “It makes people question, ‘Have we addressed this issue? Have we thought about that?’ They may realize themselves that they could build a better model with a bit more thought and time.”
Among the papers that Selby and his colleagues found to present a high risk of bias was one published in Nature from researchers at Icahn School of Medicine at Mount Sinai in New York.
The paper found that an AI model for diagnosing Covid-19 on chest CT scans performed well on a common accuracy measure — area under the curve of .92 — and equaled the performance of a senior thoracic radiologist. A press release that accompanied the paper’s release said the tool “could help hospitals across the world quickly detect the virus, isolate patients, and prevent it from spreading during this pandemic.”
But the University of Cambridge researchers flagged the paper for a high risk of bias due to its small sample size of 424 Covid-positive patients spread across datasets used to train, tune, and test the AI. The data were obtained from 18 medical centers in China but it was unclear which centers provided the data on the positive and negative cases, which raises the possibility that the AI could simply be detecting differences in scanning methods and equipment, rather than in the physiology of the patients. The Cambridge researchers also noted that performance was not tested on an independent dataset to verify its ability to reliably recognize the illness in different groups of patients.
The paper did acknowledge the study’s small sample size and the need for additional data to test the AI in different patient populations, but the research team did not respond to a request for additional comment.
Time constraints may explain, if not excuse, some of the problems found with AI models developed for Covid-19. But similar methodological flaws are common in a wide swath of machine learning research. Pointing out these lapses has become its own subgenre of medical research, with many papers and editorials calling for better evaluation models and urging researchers to be more transparent about their methods.
The inability to replicate findings is especially problematic, eroding trust in AI and undermining efforts to deploy it in clinical care.
A recent review of 511 machine learning studies across multiple fields found that the ones produced in health care were particularly hard to replicate, because the underlying code and datasets were seldom disclosed. The review, conducted by MIT researchers, found that only about 23% of machine learning studies in health care used multiple datasets to establish their results, compared to 80% in the adjacent field of computer vision, and 58% in natural language processing.
It is an understandable gap, given the privacy restrictions in health care and the difficulty of accessing data that spans multiple institutions. But it nonetheless makes it more difficult for AI developers in health care to obtain enough data to develop meaningful models in the first place, and makes it even harder for them to publicly disclose their sources so findings can be replicated.
Google recently announced an app that uses AI to analyze skin conditions, but declined to publicly disclose the sources of data used to create the model. A spokesperson explained that some of the datasets are licensed from third parties or donated by users, and that the company could not publish the data under the terms of its agreements.
McDermott, the MIT researcher, said these structural barriers must be overcome to ensure that the effects of these tools can be fully evaluated and understood. He noted a number of ways to share data without undermining privacy or intellectual property, such as use of a federated learning method in which institutions can jointly develop models without exchanging their data. Others are also using synthetic data — or data modeled on real patients — to help preserve privacy.
McDermott said careful scrutiny of machine learning tools, and the data used to train them, is particularly important because they are making correlations that are hard, if not impossible, for humans to independently verify.
It is also important to consider the time-locked nature of AI models when they are evaluated. A model trained on one set of data that is then deployed in an ever-changing world is not guaranteed to work in the same way. The effects of diseases on patients can change, and so can the methods of treating them.
“We should inherently be more skeptical of any claims of long-term generalizability and stability of the results over time,” McDermott said. “A static regulatory paradigm where we say, ‘OK, this algorithm gets a stamp of approval and now you can go do what you want with it forever and ever’ — that feels dangerous to me.”