Britannica Sues OpenAI Over ChatGPT ‘Memorization’

▼ Summary
– Encyclopedia Britannica and Merriam-Webster sued OpenAI on Friday, alleging it used their copyrighted content to train its AI models.
– The lawsuit claims OpenAI’s GPT-4 has “memorized” and can output near-verbatim copies of Britannica’s content without permission.
– Britannica argues OpenAI’s responses directly compete with its own content, “cannibalizing” web traffic instead of directing users to its site.
– This case is part of a growing trend of copyright lawsuits from publishers against AI companies over training data.
– Similar legal actions include The New York Times’ lawsuit against OpenAI and a $1.5 billion settlement by Anthropic with authors.
The legal landscape for artificial intelligence is facing a significant new challenge as two of the world’s most trusted reference publishers, Encyclopedia Britannica and Merriam-Webster, have filed a lawsuit against OpenAI. The core allegation is that the company used their copyrighted material to train its AI models, leading to outputs that are nearly identical to the original protected content. This case highlights the escalating tension between content creators and AI developers over the use of copyrighted data for training purposes.
According to the legal filing, OpenAI engaged in repeated and unauthorized copying of Britannica’s content. The publishers assert that GPT-4 has effectively “memorized” substantial portions of their copyrighted encyclopedia entries and can reproduce them almost verbatim when prompted. These memorized examples, the lawsuit argues, constitute unauthorized copies that were directly used in the training process for OpenAI’s models. The complaint includes side-by-side comparisons showing responses from OpenAI’s models that match Britannica’s text word for word across entire passages.
Beyond the issue of direct copying, the lawsuit presents a broader commercial grievance. Britannica claims that OpenAI’s practices are actively harming its business by “cannibalizing” its web traffic. Instead of functioning like a traditional search engine that directs users to the original source, ChatGPT generates responses that directly substitute for or compete with Britannica’s own content. This, the publishers argue, diverts potential visitors and undermines their revenue model.
This legal action is part of a rapidly expanding wave of copyright litigation targeting AI companies. The New York Times is pursuing a similar case against OpenAI, accusing it of copying vast quantities of its journalism. In a related development from last September, Anthropic settled a major class-action lawsuit concerning its use of copyrighted books for AI training, agreeing to a settlement fund valued at $1.5 billion for the affected authors. These cases collectively underscore a fundamental and unresolved question about the permissible boundaries of data usage in the development of generative AI technologies.
(Source: The Verge)





