Academics, nonprofits caught in middle of data consent fight as AI companies push for access to copyrighted works

pages
The open web is closing as copyright holders bar companies like OpenAI & Cohere from using their work.

As major artificial intelligence (AI) companies like OpenAI, Cohere, and Anthropic seek more copyrighted works to use to train their models, it is academics, nonprofits, and early-stage AI startups who are struggling the most to find material they are allowed to use, experts told BetaKit.

AI companies are running out of data sources as they seek to continue scaling and advancing their large language models, which are trained on trillions of words of human-generated text from around the web. A June paper by research institute Epoch AI estimated the stock of human-generated public texts to be around 300 trillion “tokens,” which could be fully used by LLMs between 2026 and 2032, or even earlier if models are “intensely over-trained.” 

Notably, Epoch didn’t define public texts beyond identifying non-public texts as instant messages, suggesting its paper refers to availability rather than a guaranteed right to use.

OpenAI web crawlers were the most heavily restricted, followed by Anthropic, and Google’s AI crawler.

Restrictions are quickly being added or enforced to a significant chunk of the data that companies are using to train their models. OpenAI, Microsoft, Stability AI, Anthropic, Udio and Suno are facing copyright lawsuits from newspapers, authors, and some of the world’s largest record labels. And a growing number of web publishers are attempting to bar AI web crawlers from scraping their content.

Between April 2023 and April 2024, AI-specific restrictions were added on five percent of the data in three widely used datasets—C4, RefinedWeb and Dolma—and on 25 percent of the most critical sources, according to a July study by the Data Provenance Initiative, a volunteer collective of global AI researchers that audits AI-training datasets.  

The study examined 14,000 domains and found a “proliferation of AI-specific clauses to limit use” in websites’ robots.txt files, which tell web crawlers what they are allowed to take and what they can’t. There’s no legal requirement for crawlers to respect a website’s robots.txt file.  

Shayne Longpre, lead of the Data Provenance Initiative and a co-author of the July report, said the percentage of data now under restrictions is “significantly higher” when factoring in websites’ terms of service documents. Between 45 and 55 percent of websites in the three datasets now have data use restrictions in their terms of service, which are legally enforceable. The paper estimated the amount of restrictions will only continue to grow.

“What this means for the organizations and AI developers that do respect robots.txt preferences … the quality of models they can produce will be worse,” said Longpre, who is also a PhD candidate at the Massachusetts Institute of Technology researching the responsible training, evaluation, and governance of general-purpose AI systems. 

The quantity and quality of data is important to large language and other foundational models, he said, and these restrictions hit both. Sources of high-quality data, including news websites, periodicals and social media platforms, were most likely to have implemented restrictions.

OpenAI web crawlers were the most heavily restricted (25.9 percent of the data), followed by Anthropic (13.3 percent), and Google’s AI crawler (9.8 percent). Toronto-based Cohere was the least restricted, at 4.9 percent.

Caught up in a copyright fight

Major AI developers have said that they need more data, and particularly copyrighted data, to improve the quality of their models. In a submission to the British House of Lords in January, OpenAI argued it should be allowed access to copyrighted data for free because it would be otherwise “impossible” to continue training ChatGPT.

However, Roxana Sultan, chief data officer and vice-president of health at the Vector Institute, told BetaKit that AI models don’t necessarily need copyrighted data to continue improving. 

“The organizations that are most hard-hit by this are smaller companies that can’t afford to spin up more complex crawlers…as well as web archives, academics, and nonprofits.”

“Right now there’s a lot of research happening specifically to address that question,” she said. Companies can use a mix of open-source and synthetic data—data that’s artificially generated specifically to train AI models—and modify it to increase the diversity and quantity of the training sample. 

“The key to improving the quality of a gen AI model is ensuring the quality and diversity of the data. That doesn’t necessarily mean data forever, data ad infinitum,” she said. 

Longpre said the foreclosure of parts of the open web—the parts of the web that are accessible without a password or paywall—has unintended consequences. The Internet Archive and Common Crawl, which crawl the web for archival purposes, are heavily used by academics and journalists, and “tens of thousands of academic articles have been written off the back of that research,” he said. “But at the same time, they can also be used for nonprofit and corporate AI, so there’s a conflation of uses that makes it hard right now for websites and data creators.” 

Having to ban crawlers with public-interest functions to avoid AI startup scraping is an example of the challenges web publishers face when trying to express their preferences for how their content be used, Longpre said. They have been forced to enumerate every possible AI-related crawler and disallow them through their robots.txt file, rather than being able to express preferences such as only having their content used for non-commercial purposes, or only with attribution. Recent reporting by 404 Media has revealed the difficulties publishers have faced in attempting to stop Anthropic from scraping their content. 

“The largest AI companies can likely afford to license a lot of the highest-quality data, and some are already doing that. It’s also the case that they have really strong crawling technology, and if robots.txt files are not enforced, depending on the outcome of these lawsuits in the U.S., they may keep fighting to get access to this data anyway,” he said. “The organizations that are most hard-hit by this are smaller companies that can’t afford to spin up more complex crawlers…as well as web archives, academics, and nonprofits in the field of AI and outside of it.” 

Sultan told BetaKit that the Vector Institute has seen those growing limitations. 

“The restrictions and things like that are certainly very visible to us,” she said. “Certainly for academic organizations and organizations that do research, and small and medium enterprises in the AI and machine learning space, if they’re early in their journey and still in the R&D phase and doing a lot of work similar to an academic institution, they’ll get hit with even more restrictions because there are greater restrictions on anything construed as commercial.” 

Consultation, negotiation, litigation

As data consent becomes a growing issue, Cohere, Google, and Microsoft have asked the Canadian federal government for a legal exemption to the copyright act that would allow them to use copyrighted materials to train their models without being required to pay rights-holders or obtain their permission.

Innovation, Science and Economic Development Canada launched a public consultation last fall asking for input on whether to amend the copyright act in response to generative AI systems. One question asked whether AI firms should license copyrighted material when training commercial models, or whether to offer an exemption similar to the fair dealing provision in the Copyright Act, which allows for the use of copyrighted works for purposes such as education, criticism and review, and satire and parody.

All three companies argued that text and data mining for AI training isn’t copyright infringement, claiming the AI model is consuming data only to learn concepts, facts, and patterns, and not for its expressive content. Being required to pay for copyrighted content could risk Canada’s lead in AI development, they argued, and further contribute to the country’s productivity crisis. 

“What we find from big tech companies is a resistance to admit liability and sit down and negotiate.” 

Margaret McGuffin
Music Publishers Canada

Microsoft noted in its submission that there is a “shortfall in the data that companies and organizations have access to in order to benefit from the promise of AI” that access to copyrighted works could help to address. “More open access to data is needed to help organizations of all types take advantage of AI.”

Both Microsoft and Google Canada declined BetaKit’s interview requests. Cohere did not respond to multiple emails.

Cohere said in its submission that rights holders who don’t want their work to be used for AI training can use technological protection measures—such as passwords, paywalls, captchas or encryption—to prevent it. Microsoft said it has introduced options for creators to request their works not surface in outputs by the company’s Bing generative AI chatbot, and that Bing will also not engage in copyright-infringing outputs like providing song lyrics or excerpts of novels if requested, changes that were made after discussions with creators.

The companies said they shouldn’t be required to disclose data in their training sets, arguing that LLMs ingest far too much data to be practical, and the datasets can change as a foundational model is continually developed.

Paul Banwatt, a partner at Gilbert’s LLP in Toronto whose practice includes high tech intellectual property litigation, emerging technologies and music and entertainment law, said disclosure and attribution of data sources “is not foreign” to a fair dealing-type exemption if the government were to ultimately go that route. But he added that even if AI companies were given that exemption, rights-holders are likely to continue restricting access to their works through their terms of service. 

“[Companies] may be in a position where, even if you had the right under the copyright act to train your model, you have to disclose that you did so in an unlawful way via terms of use [violations],” he said.

Numerous groups representing artists, musicians, authors, and other copyright holders urged the government to reject those proposals. Music Publishers Canada, in its submission, noted AI developers themselves consider copyrighted works to be quality sources of data, and argued an “opt-out” copyright system and lack of requirement to disclose sources would kill the emerging market for licensing music to AI companies. 

“The false narrative is that licensing will stifle innovation, and we know that’s absolutely not true,” said Margaret McGuffin, MPC’s chief executive officer. “What we find from big tech companies is a resistance to admit liability and sit down and negotiate.” 

McGuffin said she wants to see the government require companies to disclose the data used to train their models, similar to what the European Union has done, which she said will “help in…compelling companies to come to the table or face litigation.”

Sultan said that while she can’t speak for other organizations’ data management practices, it is possible, with effort and resources, to properly attribute and disclose the data sources that were used to train a model. 

Vector has a five-step process for evaluating prospective datasets, including searching it for any copyrighted material with terms too restrictive to use, identifying any ethical concerns with the data, and obtaining any permissions necessary for promising datasets. In its submission to the consultations, the institute noted there are numerous Creative Commons licenses for datasets that allow for non-commercial and even commercial use with attribution. The institute has also been building a tool that could flow the attribution required in the training dataset through to the final product, said Jessica Blackman, Vector’s director of health research operations.

“Ultimately the quality of your model, the types of models we’re trying to create and put out there, that really is very heavily dependent on good quality data. Good quality data usually does come with some degree of parameters,” Sultan said. “It does translate to a return on that investment to seek out the top quality and ensure compliance with their parameters.”

Feature image courtesy Patrick Tomasso via Unsplash.

0 replies on “Academics, nonprofits caught in middle of data consent fight as AI companies push for access to copyrighted works”