The head of Cohere’s research division is concerned that alleged unreliability in the rankings of a popular chatbot leaderboard amounts to a “crisis” in artificial intelligence (AI) development.
A new study co-authored by Sara Hooker, head of Cohere Labs, along with researchers at Cohere and leading universities, claims that large AI companies have been “gaming” the crowd-sourced chatbot ranking platform LM Arena to boost the ranking of their large language models (LLMs).
“I hope that there is an active discussion and a sense of integrity that maybe this paper has spurred.”
Sara Hooker
Cohere
“One of the benchmarks that is most widely used, most highly visible, has shown a clear pattern of unreliability in rankings,” Hooker, who is also vice-president of research at Cohere, said in an interview with BetaKit.
Hooker and her co-authors are trying to highlight a what she says are lacks of transparency and trustworthiness that are eroding the value of an AI model leaderboard widely used by academia, enterprises, and the public.
The research paper, titled “The Leaderboard Illusion,” was written by researchers from Cohere, Cohere Labs, Stanford University, Princeton University, the University of Waterloo, the University of Washington, MIT, and Ai2. It was published on the open-access platform ArXiv and has not yet been peer reviewed.
LM Arena’s “Chatbot Arena” has become a leading public metric for ranking LLMs. It was spun out from a research project at the University of California, Berkeley. The “arena” gimmick comes from users comparing the performance of two chatbots side-by-side in a “battle” and voting for the winner.
The paper authors accuse LM Arena of allowing leading AI developers—such as Meta, Google, and OpenAI—to conduct extensive pre-release private testing and retract scores for models that did not perform as well. It also claims these developers get more testing opportunities, or more “battles,” giving them access to more data compared to open-source providers. The authors say this results in “preferential treatment” at the expense of competitors.
RELATED: Did Cohere give Canada its DeepSeek moment?
The paper claims that LM Arena only made it clear to some model providers that they could run multiple pre-release tests at once. According to the analysis, Meta privately tested 27 versions of Llama-4 before releasing its final model, which ranked high on the leaderboard when it debuted.
Meta had already been caught uploading a different version of Llama-4 to Chatbot Arena that was optimized for human preference. In response, LM Arena updated its policy and stated that Meta’s conduct “did not match what we expect from model providers.”
In a post on X, LM Arena denied the notion that some model providers are treated unfairly and listed a number of “factual errors” in the paper. The organization said its policy, which allows model providers to run pre-release testing, had been public for a while.
Cohere’s Hooker directly responded on X to some of the critiques LM Arena raised about the paper, and thanked LM Arena for its engagement.
“I’m hoping we have a more substantial conversation. We want changes to the arena,” Hooker told BetaKit. “This was an uncomfortable paper to write.”
The paper’s authors called on LM Arena to cap the number of private variant models that a model provider can submit. They also called for a ban on retracting submitted scores, improvements to sampling fairness, and greater transparency surrounding which models are removed and when.
“I hope that there is an active discussion and a sense of integrity that maybe this paper has spurred,” Hooker said. “It’s so critical that we acknowledge that this is just bad science.”
RELATED: At World Summit AI, cautious tone of researchers drowned out by cutthroat adoption race
Cohere Labs is the non-profit research lab run out of Toronto-based LLM developer Cohere. As Canada’s largest entrant in the uber-competitive, capital-intensive AI race, Cohere’s models have not soared to the top of the Chatbot Arena leaderboard. Its Command A model, which the company claims outperformed OpenAI’s GPT-4o from last November as well as DeepSeek’s v3, is ranked 19th.
Cohere caters exclusively to enterprise clients, putting it in a narrower boat than some competitors, but most tech giants are making enterprise plays as tech companies feel pressure to integrate AI into their workflows.
Alternative benchmarking
Deval Pandya, VP of engineering at the Toronto-based not-for-profit Vector Institute, told BetaKit that this discourse highlights a need to continue improving AI model evaluations.
The Vector Institute is the brainchild of Waabi CEO Raquel Urtasun, Deep Genomics CEO Brendan Frey, and Canadian Nobel Prize winner Geoffrey Hinton. The non-profit research institute recently released its own comprehensive evaluation of 11 leading AI models.
Vector’s AI model leaderboard is not dynamic or crowd-sourced. Instead, it displays scores for nearly a dozen scientific benchmarks, from mathematical reasoning tasks to code generation.
To Pandya, an evaluation like Vector’s serves a different but still important purpose than Chatbot Arena. He argued that consumers can benefit from crowd-sourced data based on human preference, while enterprises might want something more granular if they are looking to mix and match different AI models for a business use case.
AI companies have an incentive to self-report only the best progress they have made, especially when they are public companies, Pandya said. The challenge is to pursue objective model evaluations to make sure the claims companies make are true. And he said there’s a need for more independent projects like Vector’s to help evaluate all available models, not just those in the limelight.
“The goal is to democratize how we think about evaluations,” Pandya said.
Feature image courtesy World Economic Forum, CC BY-NC-SA 2.0, via Flickr.