RSS Co-Creator Unveils New AI Data Licensing Protocol

▼ Summary
– The AI industry faces up to 40 pending copyright lawsuits over unlicensed training data, including a case against Midjourney for creating Superman images.
– Real Simple Licensing (RSL) has been launched by technologists and web publishers to enable scalable data licensing, backed by major sites like Reddit and Yahoo.
– RSL includes both technical protocols for machine-readable licensing terms in robots.txt files and a legal collective for negotiating royalties and terms.
– The system allows publishers to set custom or Creative Commons terms and provides a collective option for smaller publishers unable to negotiate individual deals.
– A key challenge is tracking when specific data is used in AI training, but RSL creators believe companies can develop adequate reporting systems to facilitate payments.
The artificial intelligence sector faces a mounting legal challenge concerning the data used to train its models. Following a landmark $1.5 billion copyright settlement by Anthropic, the industry is under pressure to address how it sources and compensates for training materials. With dozens of lawsuits pending, including one targeting Midjourney for generating unlicensed Superman imagery, the absence of a clear licensing framework threatens to stifle innovation through protracted legal battles.
A new initiative aims to provide that framework. Real Simple Licensing (RSL), developed by a coalition of technologists and publishers, offers a scalable system for data licensing that could help AI companies and content creators reach mutually beneficial agreements. Already supported by major platforms like Reddit, Quora, and Yahoo, RSL introduces both technical and legal mechanisms to streamline permissions and payments across the web.
Eckart Walther, a co-creator of both RSS and RSL, emphasizes the need for machine-readable licensing agreements online. “That’s really what RSL solves,” he explains. The protocol allows publishers to specify licensing terms within their robots.txt files, clarifying whether AI firms need custom agreements or can rely on existing structures like Creative Commons.
On the legal front, the newly formed RSL Collective functions as a centralized body for negotiation and royalty distribution, drawing inspiration from collective rights organizations in music and film. This approach simplifies the process for both licensors and rights holders, especially smaller publishers who lack the leverage to negotiate individual deals.
Several prominent publishers have already joined the collective, including Yahoo, Medium, O’Reilly Media, Ziff Davis, Internet Brands, People Inc., and The Daily Beast. Others, such as Fastly and Adweek, endorse the standard without formal membership. Notably, Reddit, which already earns an estimated $60 million annually from Google for data licensing, participates in the system while maintaining its existing agreements.
A significant challenge lies in tracking usage. Unlike music royalties, which are logged per play, AI training data is often absorbed without clear attribution. Some licenses even propose per-inference payments, adding complexity to an already opaque process. Still, RSL co-founder Doug Leeds remains optimistic, noting that some AI firms already possess the tracking capabilities required for compliance. “It doesn’t have to be perfect,” he says. “It just has to be good enough to get people paid.”
The real test will be whether AI companies adopt the system. While firms like ScaleAI and Mercor demonstrate a willingness to pay for quality data, many labs still rely on free resources like Common Crawl. Distinguishing between legitimate scraping and machine-enhanced browsing remains difficult, as recent disputes between CloudFlare and Perplexity illustrate.
Yet Leeds points to public statements from AI leaders, including Google’s Sundar Pichai, advocating for standardized licensing. “They have said outwardly to everyone, something like this needs to exist,” he notes. With RSL now operational, the industry may finally have the system it claims to need.
(Source: TechCrunch)