AI & TechArtificial IntelligenceBigTech CompaniesNewswireTechnology

AI Can Recreate Entire Novels from Training Data

▼ Summary

– Top AI models can be prompted to generate near-verbatim copies of copyrighted novels, challenging industry claims that they do not store such works.
– Recent studies show models from major AI companies memorize far more of their training data than was previously understood.
– This memorization undermines a core legal defense in copyright lawsuits, which argues models “learn” from but do not copy protected material.
– The AI industry contends its use of copyrighted material for training constitutes “fair use” by transforming it into something new.
– Specific research demonstrated that strategic prompts could cause models to output large, accurate portions of books like Harry Potter.

Recent research reveals a startling capability of advanced artificial intelligence systems: they can be prompted to reproduce lengthy passages, and even entire novels, from the data used to train them. This finding directly challenges a fundamental claim made by AI developers, who have long asserted that their models do not retain copies of copyrighted material. The discovery has significant implications for the numerous copyright lawsuits currently facing the industry, potentially undermining a key legal defense.

A string of new studies demonstrates that the most sophisticated language models from leading companies memorize a far greater amount of their training data than was once assumed. Experts in both technology and law warn that this “memorization” ability could have serious ramifications for ongoing legal battles. The core argument from AI groups, that models “learn” from copyrighted works in a transformative way without storing copies, appears increasingly difficult to sustain.

Yves-Alexandre de Montjoye, a professor at Imperial College London, notes the mounting evidence. “There’s growing evidence that memorization is a bigger thing than previously believed,” he stated, highlighting a shift in understanding among researchers.

For years, the industry has publicly denied that such verbatim retention occurs. In official communications, companies like Google have insisted that no copy of the training data exists within the model itself. The standard position has been that using copyrighted books for training constitutes “fair use,” as the AI supposedly transforms the original material into something novel and distinct.

However, practical experiments tell a different story. A study conducted by researchers from Stanford and Yale Universities last month systematically tested this claim. By using strategic prompts, they successfully induced models from OpenAI, Google, Anthropic, and xAI to generate thousands of words directly from well-known books. The list of texts included popular titles like A Game of Thrones, The Hunger Games, and The Hobbit.

The results were remarkably specific. When asked to complete sentences from a novel, certain models reproduced the text with high fidelity. For instance, Gemini 2.5 regurgitated 76.8 percent of Harry Potter and the Philosopher’s Stone with striking accuracy. Another model, Grok 3, managed to generate 70.3 percent of the same book. These experiments provide concrete, measurable evidence that extensive memorization is not only possible but can be reliably triggered, casting doubt on the industry’s public assurances and legal strategies.

(Source: Ars Technica)

Topics

ai memorization 95% copyright infringement 90% fair use 85% large language models 80% training data 75% legal lawsuits 70% ai industry 65% academic research 60% model prompting 55% bestselling novels 50%