Startup investigation reveals 50 peer-reviewed papers contained hallucinated citations

GPTZero found “obvious” instances of false, LLM-generated references in academic papers.

GPTZero, the startup behind an artificial intelligence (AI) detector that checks for large language model (LLM)-generated content, has found that 50 peer-reviewed submissions to the International Conference on Learning Representations (ICLR) contain at least one obvious hallucinated citation—meaning a citation that was dreamed up by AI. ICLR is the leading academic conference that focuses on the deep-learning branch of AI.

“Let’s use AI, but then let’s make sure we’re also holding those things it produces up to a higher standard.”

Alex Cui,
GPTZero

The three authors behind the investigation, all based in Toronto, used their Hallucination Check tool on 300 papers submitted to the conference. According to the report, they found that 50 submissions included at least one “obvious” hallucination. Each submission had been reviewed by three to five peer experts, “most of whom missed the fake citations.” Some of these citations were written by non-existent authors, incorrectly attributed to journals, or had no equivalent match at all.

The report notes that without intervention, the papers were rated highly enough that they “would almost certainly have been published.”

“We’re pretty surprised,” Alex Cui, co-founder and CTO of GPTZero, told BetaKit. “We just struck gold but kind of in the wrong way,” he said. “There’s probably a lot more where that came from.”

GPTZero’s investigation notes that the authors have been “collaborating with the ICLR program chairs” on their findings. Cui confirmed that they have been working with ICLR to identify if other papers had hallucinated citations by running their check on all 20,000 articles submitted to ICLR 2026. “The deadline is coming up in a month to announce acceptances,” Cui said. “So we’re under a bit of a time crunch, but I think we can get it done.” 

Colin Raffel, associate professor at the University of Toronto and a program chair at ICLR, told BetaKit he and his colleagues “are continuing to flag and desk reject submissions that violate our policies.” 

Cui said that GPTZero is planning to expand the usage of its tool to other conferences, and hopefully apply its model to other scientific reviews.

The company, founded by Cui and Edward Tian, began life as a web app in December 2022 and quickly garnered 30,000 users. After an official launch in January of 2023, the user base grew to four million in 2024 and snagged a $10- million “preemptive” Series A funding round from Footwork co-founder Nikhil Basu Trivedi. At time of writing, the company boasted approximately 10 million users, including organizations from Purdue University to UCLA.

Blair Attard-Frost, assistant professor of political science at the University of Alberta and fellow at the Alberta Machine Intelligence Institute, has been studying AI governance policy and ethics for nearly 10 years. She said that GPTZero’s findings are unsurprising, given the broad rise of AI usage in academic work.

“You have this giant influx of even more papers that puts more strain on peer-review processes, to journals that are trying to address that influx of new papers,” Attard-Frost explained. “You also have a situation where a lot of people in work in academia are already really strained… and have really limited capacity to take on peer review.”

A promising but still-flawed tool

As the public conversation around LLMs has grown, their usage in academia has been studied in both authoring academic papers and in the review process. One article published in September 2023 through the Yale Journal of Biology and Medicine found that OpenAI’s ChatGPT “showed immense promise” in the journal reviewing process. When compared to human counterparts, the LLM showed it was capable of “identifying methodological flaws,” among other benefits.

But further research on the subject has identified crucial flaws of LLM usage to accurately reference citations in an academic context. One paper found that LLMs used in reviews of ICLR 2024 submissions gave authors “inflated” scores and ultimately “boosted acceptance rates” for papers submitted into the conference. Another article, published in April of this year by Kevin Wu et al. on the effectiveness of LLMs to cite medical references, found that “between 50 percent and 90 percent of LLM responses are not fully supported, and sometimes contradicted, by the sources they cite.” 

RELATED: Academics, nonprofits caught in middle of data consent fight as AI companies push for access to copyrighted works

James Zou, associate professor of biomedical data science at Stanford University, noted in November of 2024 that up to 17 percent of peer reviews have been written by AI, and advocated for guidelines of AI use in the academic process.

False citations generated by LLMs haven’t just touched the academic sphere. In November, the online outlet The Independent broke a story on a $1.6-million Deloitte report commissioned by the Newfoundland and Labrador government that contained incorrect citations, likely generated by AI. Soon after the story broke, Newfoundland and Labrador Minister of Government Services Mike Goosney was instructed to review AI guidelines for government-commissioned reports. Deloitte had also been previously caught including citations to non-existent academic papers, again, likely generated by AI, in a report commissioned by the Australian government.

Earlier this year, Canada’s Minister of AI and Digital Innovation Evan Solomon made headlines when he said the federal government’s priority is the economic benefits of AI, rather than “over-indexing” on regulation. Since then, the minister has hinted towards AI legislation that would regulate deepfakes and data privacy. In an interview with Christopher Guly of University Affairs, Solomon spoke generally about the relationship between the AI industry and Canadian universities. “Universities are now no longer just places of pure academic research,” Solomon said.

“A right and a wrong way” to peer review with LLMs

Attard-Frost doesn’t see the federal government’s approach as a solution to the rise of LLM usage in academic contexts. “I don’t think academics should be waiting for the federal government to be doing something about this,” she said. She describes the underlying problem as academic workers being overworked and under-supported. She also said peer review is “essentially free service work.”

Concerning the climate around AI usage in reports and academic papers in Canada, Cui stressed that proper implementation is crucial. “There’s a right and a wrong way to do it,” he said. Cui also said disavowing AI usage entirely is not helpful. “Let’s use AI, but then let’s make sure we’re also holding those things it produces up to a higher standard.”

Attard-Frost has a more skeptical view of using AI models to catch LLM-generated false citations. She pointed to GPTZero’s touted 99 percent accuracy on their Hallucination Detector. This success rate is impressive, but causes issues when applied to ICLR’s 20,000 submissions. “They’re going to falsely flag 200 submissions as potentially AI-generated, which could create academic integrity concerns for the authors of 200 papers who didn’t use AI at all in their work.”

Attard-Frost suggested there are other solutions that are possible to implement to prevent LLM-generated false citations, such as a “compounding fee submission model.” This model could allow first authors a free submission for one paper, with escalating fees for each submission where they have first authorship, dissuading cavalier use of LLMs in the process. Others included an “endorsement model,” which would validate three different people as verified humans who can work with those submitting papers to conferences. However, Attard-Frost stressed that there isn’t one fix for the issue. “It’s really application by application,” she said. “There’s not really a clear solution.”

Cui, speaking for GPTZero, said that there’s hope in their findings to show that there can be public accountability in the rise in false citations. “It’s not like a lost cause, we actually can create tools to do this.”

Feature image courtesy Unsplash. Photo by Sarah Elizabeth.

0 replies on “Startup investigation reveals 50 peer-reviewed papers contained hallucinated citations”