OpenAI Faces Scrutiny Over Deleted Pirated Book Data

▼ Summary
– OpenAI deleted two datasets, “Books 1” and “Books 2,” which were created from web-scraped data including pirated books from LibGen, before ChatGPT’s 2022 release.
– Authors in a class-action lawsuit allege ChatGPT was illegally trained on their works, and OpenAI’s deletion of these datasets could be a key factor in the case.
– OpenAI initially stated the datasets were deleted due to “non-use” but later retracted that claim and argued all deletion reasons should be protected by attorney-client privilege.
– A US district judge has ordered OpenAI to share internal communications about the deletion and references to LibGen that were previously withheld.
– The authors believe OpenAI’s shifting explanations suggest there is more to the story, and the court’s discovery order may reveal the true reasons for deleting the datasets.
The legal battle between OpenAI and a group of authors is intensifying, with a recent court order potentially forcing the company to reveal confidential communications about its deletion of key training data. This development could significantly impact the lawsuit’s outcome, which alleges that ChatGPT was trained on pirated books. The core issue revolves around two datasets, known as “Books 1” and “Books 2,” which were compiled from web-scraping and included material from the shadow library Library Genesis (LibGen). OpenAI deleted these datasets before ChatGPT’s public launch, a move now under intense judicial scrutiny.
OpenAI has stated the datasets were simply no longer in use, leading to a routine internal decision to remove them. However, the plaintiffs argue the company’s narrative has been inconsistent. They point to OpenAI initially citing “non-use” as a reason for deletion, then retracting that claim, and later asserting that all reasons for deletion, including “non-use”, should be protected by attorney-client privilege. This shifting stance occurred after the court granted the authors permission to examine internal discussions regarding the datasets’ purported “non-use.”
The authors’ legal team contends that OpenAI’s backtracking only heightened suspicions about what its internal communications might reveal. This perspective gained traction with the court. Last week, U.S. District Judge Ona Wang issued a consequential order, directing OpenAI to produce all communications with its in-house lawyers concerning the deletion of the “Books” datasets. The order also compels the company to share “all internal references to LibGen that OpenAI has redacted or withheld on the basis of attorney-client privilege.”
Judge Wang’s decision highlighted a critical contradiction in OpenAI’s legal arguments. She noted the company erred by claiming that “non-use” was not a “reason” for deleting the data, while simultaneously insisting that it should be considered a privileged “reason.” This legal misstep has opened the door for the plaintiffs to access potentially damning evidence. If the uncovered communications suggest the data was deleted to obscure its role in training ChatGPT, it could provide powerful support for the authors’ claims of copyright infringement. The information revealed in these documents may ultimately prove decisive in determining whether OpenAI improperly used pirated content to build its flagship AI model.
(Source: Ars Technica)





