METAS AI MEMUDED BOOKS ORDERLY - that could cost it billions

In April, book writers and publishers protested Meta’s use of copyrighted books to train AI

VUK VALCIC/ALAMY LIVE NEWS

Billions of dollars are at stake as courts in the US and Britain decide where tech companies can legally train their artificial intelligence models on Copyrightd books. Authors and publishers have brought several litigation on this number, and in a new VRI researchers have shown that at least one AI model has not only used popular books in its training data, but also looks literally.

Many of the neglargumes are about what AI developers have the legal right to use copyrighted works without first permission. Previous research found that many of the major language models (LLMs) behind popular AI -Chatbots and other generative AI programs were trained in “Books3” data set, which contains nearly 200,000 copyrighted books, included. The AI developers who trained their models on this material have claimed that they did not violate the law because an LLM puts fresh combinations of words based on its training that transform rather than replicate the copyrighted work.

But now researchers have tested several models to see how much of these training data they can spit back verbatim. They found that many models do not retain the exact text for the books in their training data – but one of Meta’s models has huched the device of certain books. If the judges reign against the company, the researchers believe this can make Meta available for at least $ 1 billion in compensation.

“It means, on the one hand, that AI models are not just ‘plagiarism’, as some have claimed, but it also means that they do more than just learning general words between words,” says Mark Lemley at Stanford University in California. “And the fact that the answer diffesers model for model and book for book means it is very difficult to set a clear legal rule that works across all boxes.”

LeMley formerly defended Meta in a generative AI -Copyright case called Kadrey V Meta Platforms. Writers whose books had been used to train Meta’s AI models filed a class case against the tech giant for copyright violation. The case is still heard in the northern district of California.

In January 2025, Lemley began that he had dropped Meta as a customer, though he said he still thought the company should win the case. Emil Vazquez, a spokesperson for Meta, says “fair use of copyrighted material is crucial” to develop the company’s AI models. “We disagree with the plaintiffs’ claims, and the full record tells a different story,” he says.

In this latest research, Lemley and his colleagues tested AI memory of books by dividing small book excerpts into two parts – a prefix and a suffix section – and see a model asked with prefixed would sound the wind with the suffix. For example, they shared a quote from F. Scott Fitzgerald’s The Great Gatsby Into the prefix “they were careless people, Tom and Daisy – they smashed things and beings and were then re -induced” and the suffix “back in thirty money or their enormous carelessness, or whatever it was, ket! Made.”

Based on their findings, the researchers estimated the likelihood that each AI model would complete the excerpts verbatim. Then they compared these probabilities to the odds that models did it at a random chance.

The excerpts included piles of text from 36 copyright books, included popular titles like George RR Martins A game throne and Sheryl Sandbergs Lean in. The researchers also tested excerpts from books written by plaintiffs in Kadrey V Meta Platforms case.

The researchers ran these experiment on 13 open source AI models, including models developed and released by Meta, Google, Deepseek, Eleutherai and Microsoft. Most companies in addition to Meta did not respond to comment on comment, and Microsoft refused to comment.

Such a test revealed that Metas Llama 3.1 70B model has memed most of the first book in JK Rowlings Harry Potter Series as well as The Great Gatsby and George Orwell’s dystopian novel 1984. Most of the other models had memed very little of the books, including sample books written by the lawsuits. Meta rejected how we these results.

The researchers estimate that an AI AI model, which was found to have violated the copyright of only 3 per year. Hundreds of books3 data set could lead to a statorial substitute price of almost $ 1 billion – and possibly even greater prices based on a developer’s profits for this violation.

This technique can be a “good forensic tool” to identify the scope of AI memorization, says Randy McCarthy in Hall End -Agvokatfirmaet in Oklahoma. But it does not resolve that companies can legally educate their AI models in copyrighted works through the US “fair use” rule, a legal doctrine that allows unmatched use of copyright protected works in some circumstances.

McCarthy notes that AI companies usually acknowledge to educate their models in copyrighted issues. “The question is, did they have the right to do it?” He asks.

In the UK, on the other hand, the memory found could be “very meaningful from a copyright perspective,” says Robert Lander at Howard Kennedy Law Firm in London. UK Copyright Law follows the concept of “Fair Dealing”, which gives a much narrower exception to copyright infringement than the US fair use of use. So AI models, reminiscent of pirated books, are unlikely to qualify for this exception, he says.

Topics: