AI & TechArtificial IntelligenceBigTech CompaniesNewswireTechnology

Microsoft Removes AI Training Guide Using Pirated Harry Potter

▼ Summary

– Microsoft deleted a blog post after criticism on Hacker News that it encouraged pirating Harry Potter books to train AI models.
– The post, by a senior Microsoft product manager, promoted a new Azure feature for easily adding generative AI to applications.
– It suggested using a Harry Potter book dataset as a relatable example to build Q&A systems or generate fan fiction.
– The linked dataset on Kaggle was incorrectly marked as public domain and was deleted after Ars Technica contacted the uploader.
– Commenters noted the dataset had flown under the radar of copyright holder J.K. Rowling, likely due to its relatively low download count.

Microsoft has taken down a blog post that provided a technical guide for training AI models, following criticism that it pointed developers toward a pirated collection of the Harry Potter book series. The post, which was intended to showcase a new Azure feature for integrating generative AI into applications, has been removed after a discussion on Hacker News highlighted the problematic nature of its suggested dataset. This incident underscores the ongoing tension between the rapid advancement of AI development and the critical need to respect intellectual property rights.

The now-deleted article was authored in late 2024 by a senior product manager, Pooja Kamath, who has a long tenure at the company. The post aimed to demonstrate a tool designed to simplify adding AI capabilities using Azure SQL Database and LangChain. To illustrate the feature with what it called “engaging and relatable examples,” the guide proposed using the globally recognized Harry Potter novels as a training dataset. It suggested developers could build question-answering systems or even generate new fan fiction, activities described as being “sure to delight Potterheads.”

To facilitate this, the Microsoft blog directly linked to a dataset on Kaggle containing the text of all seven books. This dataset was incorrectly labeled as being in the “public domain,” a status that is factually incorrect given J.K. Rowling’s active enforcement of her copyrights. Ars Technica confirmed the dataset had been available online for several years. While Kaggle’s policies allow for the removal of infringing material upon notice from rights holders, commenters noted the relatively modest download count, around 10,000, may have prevented it from attracting the author’s direct attention. Shortly after Ars contacted the individual who uploaded the data, a data scientist in India with no connection to Microsoft, the dataset was removed from the platform.

The swift deletion of both the Microsoft guide and the Kaggle dataset highlights the legal and ethical pitfalls companies can encounter when promoting AI training methods. Using copyrighted material without proper licensing remains a significant legal risk for AI development, even when the intent is purely educational or demonstrative. This episode serves as a cautionary tale for developers and corporations alike, emphasizing that the pursuit of accessible and compelling AI examples must be carefully balanced with a firm commitment to copyright law.

(Source: Ars Technica)

Topics

microsoft blog 95% ai training 90% copyright infringement 88% harry potter 87% Generative AI 85% online backlash 82% hacker news 80% kaggle dataset 78% corporate responsibility 77% public domain 75%