This post was originally published on here
When scientists and regulators need clear answers to health risks—such as whether Tylenol causes autism (it doesn’t)—they typically turn to systematic reviews, widely regarded as medicine’s gold standard of evidence. Many aspects of our lives are governed by findings from these reviews: from the drugs prescribed by physicians, to vaccine mandates, to environmental policy. The problem is that conducting reviews is notoriously slow and labor-intensive. New artificial intelligence (AI) tools are poised to substantially accelerate this process, transforming how quickly scientific evidence filters into society. Handled responsibly, this could improve care and save lives—but only if it preserves the standards that make these reviews trustworthy.
Systematic reviews are a method of answering a scientific question by collecting and evaluating all the relevant studies. They’re called “systematic” because (in contrast to the single-author, state-of-the-field summaries they were designed to replace) there are strict standards for how to conduct them and how to report what the team finds. In principle, this makes the process transparent and reproducible: every step—from how studies are searched for and selected, to how their quality is assessed and their results synthesized—is clearly documented. This makes it harder to cherry-pick studies to support a preferred conclusion, and easier to identify bias.
In practice, this is an arduous task. Once the research team has settled on a precise question, a typical review requires two or three researchers to screen tens of thousands of papers by reading their titles and abstracts. Finding the relevant studies requires special expertise: reviewers must know where and how to search in order to not miss important findings. Papers that look promising are then set aside for reviewers to read in full and decide whether to include that paper or not. Data from the selected papers are methodically extracted according to a plan determined beforehand, and the evidence is then synthesized.
Advertisement
Advertisement
In medicine, this process typically takes between ten and fourteen months. But they sometimes take much longer: one reviewer told me she was helping with one that had been four years in the making. These delays come with costs: In the early stages of the Covid-19 pandemic, even accelerated reviews were frequently outdated by the time they were published, meaning that clinicians and policymakers were often making decisions based on evidence syntheses that lagged behind the rapidly changing science.
The problem is getting worse as the number of published studies increases; there are more studies to screen, assess, and analyze. Modern science has thus produced a kind of paradox: the more research exists on a topic, the harder it becomes to say what the research as a whole implies.
“How we do systematic reviews needs to change,” Ella Flemyng, Head of Editorial Policy and Research Integrity at Cochrane—one of the world’s leading evidence-synthesis organizations—told me. “It’s not sustainable going forward.” That’s where AI comes in.
The most time-consuming part of a review—screening many thousands of abstracts—is a prime candidate for automation. It is also the sort of task that AI models are, at least in principle, good for.
Advertisement
Advertisement
A fairly narrow kind of AI is already baked into software reviewers use. Some tools, for instance, help reviewers by showing them the most likely relevant titles and abstracts first, and pushing the rest further down the stack. But the system doesn’t decide which studies are potentially relevant; it only sets the order of what the reviewer sees. The decision—and the reasoning behind it—still rests with the reviewer.
A new wave of tools, based on generative AI, aims to go beyond sorting papers and to automate various stages of the reviewing process. Some products, such as Elicit and SciSpace, feel like the chatbots we are so accustomed to: users can type a question, and the system returns a summary of the research (with sources). Effectively, these tools are trying to handle all aspects of the review—the search, inclusion, and synthesis. Others, like Nested Knowledge, are more constrained, and look more like the specialized software reviewers already trust, just with AI features layered in. In both cases, the promise is that work that currently takes months could soon be done in minutes or hours.
Now, a process typically filled with red tape feels like a scientific wild west. Generative AI-based tools are being heavily marketed, while strict guidelines for how to integrate them into the review pipeline have lagged behind. “Everything is moving very, very fast” said Kristen Scotti, STEM Librarian at Carnegie Mellon. “A lot of the recommendations are not out yet, so people are just kind of flopping around.”
An increasing number of reviews are being conducted with these new tools. So far, these haven’t been published in the most prestigious journals, where they are likely to make the most impact, partly because there were no widely accepted standards for what responsible AI use looks like.
Advertisement
Advertisement
This has begun to change. In November 2025, the world’s four major evidence synthesis organizations—Cochrane, the Campbell Collaboration, JBI, and the Collaboration for Environmental Evidence—published a position statement called the Responsible Use of AI In Evidence Synthesis (RAISE) to help guide the use of AI in reviews. The tone is cautious, reminding reviewers that they remain responsible for the output and that any tool has to be rigorously tested before it becomes part of a review. Flemyng, who is also an author of the statement, told me that “a lot of these [AI] uses in reviews are still exploratory. Or if you’re using them, you need to validate their use in the specific review before you actually use them. We don’t have the evidence base for a blanket roll-out for any of these tools.”
The RAISE recommendations fall short of the kinds of procedural detail that reviewers are used to. “That guidance is very general: it tells people what to do, without telling them how to do it.” Farhad Shokraneh, a researcher specializing in systematic reviews at the universities of Oxford and Bristol, told me.
Still, this statement offers a green light for integrating AI in reviewing workflows. With the backing of the leading synthesis organizations, we are likely to soon see many more—and better—reviews being published that use these tools.
The benefits will not just be (much) faster reviews. Evidence could also be updated more quickly as new studies come in, turning today’s slow, static reviews into something closer to living, continually refreshed summaries. Already, some funders, like the London-based health research charity Wellcome Trust, are dreaming of “real-time aggregation of scientific data.”
Advertisement
Advertisement
Researchers may also be able to include studies in more languages. This is a hurdle for contemporary reviews, said Margaret Foster, Director of Evidence Synthesis Services at the Medical Sciences Library at Texas A&M University. “I’ll have students who want to look at acupuncture for nausea caused by chemotherapy,” she said. “And we know there’s a lot of research coming out of China. It would be great if we could access that research.”
In an optimistic scenario, anyone could run reliable systematic reviews on their computer by simply asking the model a question and letting it filter through the literature and summarize it succinctly. But the veteran reviewers I interviewed unanimously agreed: we’re far from that point. Many are skeptical we’ll ever get there.
For starters, many of the AI-based tools are not reproducible, lacking one the fundamental characteristics of science. Multiple studies have found that prompting these models with the same query—say, “what is the effect of acupuncture on chemotherapy-induced nausea?”—at different times may result in the model selecting different studies and producing different results. Even small changes in the phrasing of the prompt can likewise yield dramatically different outputs.
Advertisement
Advertisement
These systems are black boxes: the procedure that generated an output can be difficult (and sometimes practically impossible) to trace. Together, a lack of reproducibility and transparency is a problem for evidence-based policy. It could mean recommending a cancer treatment based on a review that no one can independently verify or fully understand.
Another fundamental problem is that “AI gives you false sense that everything has been searched,” Foster told me. “But these AI tools don’t have access to all the databases.” They are often trained on only freely available scientific papers, which represents only a slice of the scientific literature. Having access to the best databases is essential for having an accurate picture of the evidence. But is also prohibitively expensive.
These concerns are even more pressing when public information itself is in flux. The Trump administration has been removing datasets and studies from the national libraries. If AI-based tools are trained on or search uneven scientific records, then we lose the original motivation for systematic reviews: namely, of synthesizing all the relevant evidence on a question.
There are also equity issues. Some tools are only available in certain countries, or behind paywalls, raising the prospect of a widening gap between high-income countries that can afford the newest systems and manage the largest databases, and low- and middle-income countries that cannot.
Advertisement
Advertisement
Implementing AI tools in reviews without addressing these risks could be disastrous, mixing an illusion of scientific credibility with poor-quality outputs. In turn, this could further erode public trust in scientists and scientific inquiry—including confidence in the safety of vaccine recommendations—where it’s already disturbingly low.
But there are ways (at least in theory) of mitigating the risks, even if doing so will require substantial work: the AI tools could be built to support a high level of reproducibility and transparency, and broader pushes for open databases could help ensure that everyone has access to the full evidential record.
I asked Shokraneh whether he thinks these tools will eventually replace him as a reviewer. “I hope so,” he said. “I may end up in the street because I will have no job. But that’s okay if that means my mom will live ten years longer.”
Advertisement
Advertisement
I hope for a future where good evidence is produced faster and more people have access to it—where I can ask a detailed question about health or economics to an app on my phone and trust that it returns a scientifically sound synthesis of the relevant available research. But we are not there yet. And until scientists have rigorously evaluated current tools, placing too much trust in their output isn’t a good idea.
Still, when I heard that Robert F. Kennedy Jr.—the nation’s Health and Human Services Secretary— claimed that the statement “Vaccines do not cause autism” is “not supported by science”, I decided to check with Elicit. In a long, scientific-looking report, it told me that studies “consistently found no association between vaccination and autism spectrum disorders.” I rephrased the question multiple times and the answer remained the same. Days later, on January 5 2026, the federal childhood immunization schedule was revised, narrowing routine recommendations for several vaccines.
AI evidence synthesis tools still have many limitations. But maybe some policymakers could already benefit from using them.







