Researcher at desk surrounded by papers labelled fake reference, hallucinated citation, and not found, with computer showing reference not found error message.

From Hand-Written Bibliographies to AI-Hallucinated Citations

May 19, 2026

During my undergraduate days, the end of an essay signalled not relief, but the dreaded ritual of writing out the references. Each bracket, colon, and page number had to be meticulously transcribed directly from the article or a borrowed photocopy (in the days when students could get free copies).

Moving into post-graduate studies brought the luxury of a shared departmental computer. The clunky word processor meant I was no longer writing citations by hand, but I was still typing them. The potential to transpose digits or misspell an author’s name was high. However, any potential damage was be contained. If a reference was wrong, it was usually a typo, unlikely to infect the literature as whole, which was more of disconnected body of knowledge linked through medical library collections and Index Medicus [1]. Equally, we tended to keep the number of citations used to a minimum (to avoid all the typing). Journal referees reviewing your submission would be just as much aware of the background literature as you – and grateful for you identifying any new source they may have missed. So, woe betide if you made one up.

Then came CD-ROMs, and then online databases, then PC Reference Managers. We rejoiced. We could suddenly hold details of a 1,000 manuscripts in our own database and import reference details into manuscripts (I have to admit that I never managed to do this successfully). The use of references in our manuscripts expanded [2][3][4]. But in that explosion of volume, the potential for error propagation exploded too. It became more challenging for a journal referee to verify every single reference, particularly in review articles that may cite over 50 articles. Citing ‘secondary sources’ (papers you haven't read but saw it cited elsewhere or you scanned the abstract) entered common practice and served as a major vector for error propagation [5]. The data showed that even in 2015 the "total quotation error rate" hovered around 25.4%, effectively, one in four citations in medical journals had some form of error. Most were minor, but nearly 12% were classified as "major errors," meaning the citation did not support the claim it was attached to. It was a wake-up call, but the potential impact felt small.

Nothing, absolutely nothing, prepared us for the coming tsunami.

The Lancet Verdict: A 12-Fold Surge

Fast forward to the present day, and the quaint ‘typo’ has evolved into what could reasonable be considered an existential threat to the evidence base. In a recent audit published in The Lancet, researchers reported on their analysis of 97.1 million references across 2.5 million papers. The results are shocking: fabricated citations have grown more than 12-fold since 2023 [6][7]. While the background rate of citation errors appears to have been steady for years, the curve hockey-sticks upward from mid-2024 onward. Perhaps we shouldn’t be surprised that this coincides with the mainstream adoption of generative AI writing tools like ChatGPT [7]. But this isn't an explosion in typos. This is wholesale hallucination. The study found that some papers had significant error rates; in one case, out of 30 references listed, 18 were completely fabricated (we assume by LLMs) [6][7].

The concern is direct. Clinical guidelines and systematic reviews assume evidence cited in primary manuscripts are real. When an author uses AI to ‘find’ supporting literature for a new drug protocol or a guideline, and that paper doesn't exist, there is the potential for errors can pass directly into patient care. You would hope that authors preparing guidelines would conduct robust literature searches and validate any work that they cite (as per instructions provided in our Insider’s Insight [8]). However, in some cases these could be the same authors that failed to conduct appropriate when writing primary manuscripts.

…it gets worse

Be prepared to compound the irony: the journals providing this information (and stand as the bastions of science) take almost no responsibility. There are no definitive prepublication reference checks and currently, over 98% of papers identified with these hallucinated references remain uncorrected by publishers [7].

The most logical response would be to stop relying entirely on machines. Publishers should pay reviewers appropriately for the time-consuming task of acting as a referee—specifically checking the validity of all references cited [9]. But, adding insult to injury, the advent of this pandemic of scientific hallucination occurred at the same time as a shortage in scientists willing and qualified to serve as referees along with an increasing number of manuscripts requiring review [10][11]. The perfect storm.

Perhaps as equally disturbing but less well publicised than the Lancet observations was the recent work describing the screening, sorting, and the feedback cycles that imperil peer review [12]. The authors developed formal models describing feedback loops that can destabilise peer review. Their analysis argues that an increasing submission volume is overloading shrinking reviewer pools, which reduces review quality and encourages even more speculative submissions, further amplifying the problem. Their formulation is notable because it frames peer review as a systems problem rather than merely an administrative inconvenience. Summarising their proposition:

increasing submissions “overtax the pool of suitable peer reviewers,”
forcing journals either to recruit less qualified reviewers or overburden existing reviewers,
thereby reducing review accuracy and threatening the stability of the system.

The toxic combination of drivers for our perfect storm are summarised in the table below. It’s so simple you can almost hear Homer Simpson saying, “doh!” However, the knee-jerk solution seems to be for publishers to ‘fight AI with AI;’ use new algorithms to check the hallucinations of old algorithms. It is like asking a liar to check the homework of another liar. The most logical approach, treating scientists as partners in the scientific endeavour and properly recognising their contribution (to what is a multi-billion dollar ponzi scheme [13]) would affect the bottom line of publishers. Better to just adopt the cheap route of automated checks and damn the future.

Proposed driver	Mechanism
Explosion in manuscript submissions	More review requests per scientist
Publish-or-perish incentives	Encourages quantity over selectivity
Growth of journals and special issues	Expands review demand
Administrative overload on academics	Less time for unpaid reviewing
Lack of reviewer incentives	Reviewing competes with rewarded activities
Increasing specialization	Harder to find suitably expert reviewers
AI-assisted manuscript production	Lowers friction for manuscript generation
Reviewer fatigue	Repeated invitations to the same experts

The Lazy Creature Hypothesis

Revisiting the curve describing the seemingly exponential increase in fabricated citations in the Lancet article, you have to wonder: is it tailing off? Large Language Models (LLMs) are getting marginally better at suppressing their ‘creativity.’ Will hallucinations continue to be a problem now that researchers know to validate any list of suggested references? Could the Lancet’s observation hide deeper threat to the future of science: the widespread author use of LLMs to provide the supporting literature for their hypotheses rather than assimilating the science captured within the literature and using it to formulate their own understanding.

We are all lazy creatures at heart, and LLMs are the ultimate temptation. Asking a chatbot to "find me five sources that support my theory" is frictionless. Actually reading a body of literature to identify a result of five relevant sources and appreciating what the data therein means is work. This is unsustainable. It’s like watching a hippo trying to edge along a thin branch to grab an apple, eventually something is going to break. In this case, it is the trust in the medical record and the ability of scientists to conduct research with integrity.

The concern is no longer merely operational (“editors are annoyed”) but epistemic: whether the scientific literature can continue to function reliably as its own quality-control system under current growth conditions. We are already seeing the increasing discussion of downstream effects on the functioning of science itself:

delayed dissemination of valid findings,
weaker filtering of flawed studies,
increased publication of low-quality work,
erosion of trust in peer review,
and distortion of scientific incentives.

A Call for Professional Referees

The recommendations from the Lancet team are insightful but perhaps naive. They suggest publishers verify references pre-publication, indexing services add metadata on "citation validity," and databases track fabricated citations [6]. The problem is not dissimilar from that experienced in MedComms some decades ago. There professional medical writers are expected to submit all source materials with supporting evidence marked-up for confirmation by the client. This represents a simple ‘human-in-the-loop’ fix, but one journals refuse to face. The additional burden on reviewers would require some reward more than the pathetic ‘recognition’ currently offered. The peer review system is in crisis; when submitting our work to journals we are regularly required to provide the details of a host of potential referees and it takes over 35 invites to get two reviewers [9]. We simply cannot expect volunteers drowning in their own work to fact-check 50 references for free.

I’ve said it before and I’ll say it again, it’s time for a radical shift. We need to stop the fantasy of AI checking AI’s homework. We need to fund professional referees, a return to peer review, introducing scientists appropriately rewarded to verify the integrity of the publication record [9][13]. Until then, treat every reference list you read with suspicion. The illusion of accuracy has shattered.

References

About the author

Tim Hardman

Managing Director

View profile

Dr Tim Hardman is the Founder and Managing Director of Niche Science & Technology Ltd., the UK-based CRO he established in 1998 to deliver tailored, science-driven support to pharmaceutical and biotech companies. With 25+ years’ experience in clinical research, he has grown Niche from a specialist consultancy into a trusted early-phase development partner, helping both start-ups and established firms navigate complex clinical programmes with agility and confidence.

Tim is a prominent leader in the early development community. He serves as Chairman of the Association of Human Pharmacology in the Pharmaceutical Industry (AHPPI), championing best practice and strong industry–regulator dialogue in early-phase research. He ia also a Board member and ex-President of the European Federation for Exploratory Medicines Development (EUFEMED) from 2021 to 2023, promoting collaboration and harmonisation across Europe.

A scientist and entrepreneur at heart, Tim is an active commentator on regulatory innovation, AI in clinical research, and strategic outsourcing. He contributes to the Pharmaceutical Contract Management Group (PCMG) committee and holds an honorary fellowship at St George’s Medical School.

Throughout his career, Tim has combined scientific rigour with entrepreneurial drive—accelerating the journey from discovery to patient benefit.