During my undergraduate days, the end of an essay signalled not relief, but the dreaded ritual of writing out the references. Each bracket, colon, and page number had to be meticulously transcribed directly from the article or a borrowed photocopy (in the days when students could get free copies).
Moving into post-graduate studies brought the luxury of a shared departmental computer. The clunky word processor meant I was no longer writing citations by hand, but I was still typing them. The potential to transpose digits or misspell an author’s name was high. However, any potential damage was be contained. If a reference was wrong, it was usually a typo, unlikely to infect the literature as whole, which was more of disconnected body of knowledge linked through medical library collections and Index Medicus [1]. Equally, we tended to keep the number of citations used to a minimum (to avoid all the typing). Journal referees reviewing your submission would be just as much aware of the background literature as you – and grateful for you identifying any new source they may have missed. So, woe betide if you made one up.
Then came CD-ROMs, and then online databases, then PC Reference Managers. We rejoiced. We could suddenly hold details of a 1,000 manuscripts in our own database and import reference details into manuscripts (I have to admit that I never managed to do this successfully). The use of references in our manuscripts expanded [2][3][4]. But in that explosion of volume, the potential for error propagation exploded too. It became more challenging for a journal referee to verify every single reference, particularly in review articles that may cite over 50 articles. Citing ‘secondary sources’ (papers you haven't read but saw it cited elsewhere or you scanned the abstract) entered common practice and served as a major vector for error propagation [5]. The data showed that even in 2015 the "total quotation error rate" hovered around 25.4%, effectively, one in four citations in medical journals had some form of error. Most were minor, but nearly 12% were classified as "major errors," meaning the citation did not support the claim it was attached to. It was a wake-up call, but the potential impact felt small.
Nothing, absolutely nothing, prepared us for the coming tsunami.
The Lancet Verdict: A 12-Fold Surge
Fast forward to the present day, and the quaint ‘typo’ has evolved into what could reasonable be considered an existential threat to the evidence base. In a recent audit published in The Lancet, researchers reported on their analysis of 97.1 million references across 2.5 million papers. The results are shocking: fabricated citations have grown more than 12-fold since 2023 [6][7]. While the background rate of citation errors appears to have been steady for years, the curve hockey-sticks upward from mid-2024 onward. Perhaps we shouldn’t be surprised that this coincides with the mainstream adoption of generative AI writing tools like ChatGPT [7]. But this isn't an explosion in typos. This is wholesale hallucination. The study found that some papers had significant error rates; in one case, out of 30 references listed, 18 were completely fabricated (we assume by LLMs) [6][7].
The concern is direct. Clinical guidelines and systematic reviews assume evidence cited in primary manuscripts are real. When an author uses AI to ‘find’ supporting literature for a new drug protocol or a guideline, and that paper doesn't exist, there is the potential for errors can pass directly into patient care. You would hope that authors preparing guidelines would conduct robust literature searches and validate any work that they cite (as per instructions provided in our Insider’s Insight [8]). However, in some cases these could be the same authors that failed to conduct appropriate when writing primary manuscripts.
…it gets worse
Be prepared to compound the irony: the journals providing this information (and stand as the bastions of science) take almost no responsibility. There are no definitive prepublication reference checks and currently, over 98% of papers identified with these hallucinated references remain uncorrected by publishers [7].
The most logical response would be to stop relying entirely on machines. Publishers should pay reviewers appropriately for the time-consuming task of acting as a referee—specifically checking the validity of all references cited [9]. But, adding insult to injury, the advent of this pandemic of scientific hallucination occurred at the same time as a shortage in scientists willing and qualified to serve as referees along with an increasing number of manuscripts requiring review [10][11]. The perfect storm.
Perhaps as equally disturbing but less well publicised than the Lancet observations was the recent work describing the screening, sorting, and the feedback cycles that imperil peer review [12]. The authors developed formal models describing feedback loops that can destabilise peer review. Their analysis argues that an increasing submission volume is overloading shrinking reviewer pools, which reduces review quality and encourages even more speculative submissions, further amplifying the problem. Their formulation is notable because it frames peer review as a systems problem rather than merely an administrative inconvenience. Summarising their proposition:
- increasing submissions “overtax the pool of suitable peer reviewers,”
- forcing journals either to recruit less qualified reviewers or overburden existing reviewers,
- thereby reducing review accuracy and threatening the stability of the system.
The toxic combination of drivers for our perfect storm are summarised in the table below. It’s so simple you can almost hear Homer Simpson saying, “doh!” However, the knee-jerk solution seems to be for publishers to ‘fight AI with AI;’ use new algorithms to check the hallucinations of old algorithms. It is like asking a liar to check the homework of another liar. The most logical approach, treating scientists as partners in the scientific endeavour and properly recognising their contribution (to what is a multi-billion dollar ponzi scheme [13]) would affect the bottom line of publishers. Better to just adopt the cheap route of automated checks and damn the future.
Proposed driver
|
Mechanism
|
| Explosion in manuscript submissions |
More review requests per scientist |
| Publish-or-perish incentives |
Encourages quantity over selectivity |
| Growth of journals and special issues |
Expands review demand |
| Administrative overload on academics |
Less time for unpaid reviewing |
| Lack of reviewer incentives |
Reviewing competes with rewarded activities |
| Increasing specialization |
Harder to find suitably expert reviewers |
| AI-assisted manuscript production |
Lowers friction for manuscript generation |
| Reviewer fatigue |
Repeated invitations to the same experts |
The Lazy Creature Hypothesis
Revisiting the curve describing the seemingly exponential increase in fabricated citations in the Lancet article, you have to wonder: is it tailing off? Large Language Models (LLMs) are getting marginally better at suppressing their ‘creativity.’ Will hallucinations continue to be a problem now that researchers know to validate any list of suggested references? Could the Lancet’s observation hide deeper threat to the future of science: the widespread author use of LLMs to provide the supporting literature for their hypotheses rather than assimilating the science captured within the literature and using it to formulate their own understanding.
We are all lazy creatures at heart, and LLMs are the ultimate temptation. Asking a chatbot to "find me five sources that support my theory" is frictionless. Actually reading a body of literature to identify a result of five relevant sources and appreciating what the data therein means is work. This is unsustainable. It’s like watching a hippo trying to edge along a thin branch to grab an apple, eventually something is going to break. In this case, it is the trust in the medical record and the ability of scientists to conduct research with integrity.
The concern is no longer merely operational (“editors are annoyed”) but epistemic: whether the scientific literature can continue to function reliably as its own quality-control system under current growth conditions. We are already seeing the increasing discussion of downstream effects on the functioning of science itself:
- delayed dissemination of valid findings,
- weaker filtering of flawed studies,
- increased publication of low-quality work,
- erosion of trust in peer review,
- and distortion of scientific incentives.
A Call for Professional Referees
The recommendations from the Lancet team are insightful but perhaps naive. They suggest publishers verify references pre-publication, indexing services add metadata on "citation validity," and databases track fabricated citations [6]. The problem is not dissimilar from that experienced in MedComms some decades ago. There professional medical writers are expected to submit all source materials with supporting evidence marked-up for confirmation by the client. This represents a simple ‘human-in-the-loop’ fix, but one journals refuse to face. The additional burden on reviewers would require some reward more than the pathetic ‘recognition’ currently offered. The peer review system is in crisis; when submitting our work to journals we are regularly required to provide the details of a host of potential referees and it takes over 35 invites to get two reviewers [9]. We simply cannot expect volunteers drowning in their own work to fact-check 50 references for free.
I’ve said it before and I’ll say it again, it’s time for a radical shift. We need to stop the fantasy of AI checking AI’s homework. We need to fund professional referees, a return to peer review, introducing scientists appropriately rewarded to verify the integrity of the publication record [9][13]. Until then, treat every reference list you read with suspicion. The illusion of accuracy has shattered.
References
- Hardman TC (2018). Do you remember Index Medicus? Ramblings of an old fool.
- Ucar, I., López-Fernandino, F., Rodriguez-Ulibarri, P. et al. Growth in the number of references in engineering journal papers during the 1972–2013 period. Scientometrics 98, 1855–1864 (2014).
- Jeppe Nicolaisen & Tove Faber Frandsen, 2021. "Number of references: a large-scale study of interval ratios," Scientometrics, Springer; Akadémiai Kiadó, vol. 126(1), pages 259-285, January.
- Susana Sánchez-Gil, Juan Gorraiz, David Melero-Fuentes. Reference density trends in the major disciplines. Journal of Informetrics, Volume 12, Issue 1, 2018, Pages 42-58.
- Jergas H, Baethge C. Quotation accuracy in medical journal articles—a systematic review and meta-analysis. PeerJ. 2015;3:e1364.
- Topaz M, et al. Fabricated references in biomedical literature: a 2023-2026 audit of PubMed Central. Lancet. 2026 May;387(10033):e21-e22.
- Retraction Watch. One in 277 PubMed-indexed papers in 2026 shows fabricated references, says analysis. 2026 May 7.
- Niche Science & Technology Ltd. (2020). An Insider’s Insight on Literature Searches.
- Longrois D, Rozencwaig S, Guinot PG. Rethinking peer review in medicine: From trust to transformation. J Crit Care Med. 2025;11(3):205-207.
- Albert, A.Y.K., Gow, J.L., Cobra, A. et al. Is it becoming harder to secure reviewers for peer review? A test with data from five ecology journals. Res Integr Peer Rev 1, 14 (2016).
- Meyerson, L.A., Suzzi-Simmons, A. & Simberloff, D. Quantifying reviewer declines in scientific publishing: twenty-one years of data from biological invasions 2002–2024.Biol Invasions 27, 223 (2025).
- Bergstrom CT, Gross K. Screening, sorting, and the feedback cycles that imperil peer review. PLoS Biol. 2026 Feb 24;24(2):e3003650.
- Hardman TC (2024). Are Science Publishers Monster Predators?