Reddit Blocks Internet Archive to Stop AI Data Scraping

▼ Summary
– Reddit is blocking the Internet Archive (IA) from indexing its content after discovering AI firms bypassed scraping restrictions by using IA’s archived data.
– Previously, IA’s Wayback Machine archived Reddit pages, profiles, and comments, but now only screenshots of the homepage will be saved, limiting its usefulness.
– Reddit has not named the AI firms involved but confirmed they violated platform policies by scraping data from the Wayback Machine.
– Reddit suggests IA could take steps to prevent AI scraping, potentially leading to lifted restrictions, but scraping blocks are increasing starting now.
– Reddit cites privacy concerns, noting the Wayback Machine archives deleted user content, justifying the new restrictions.
Reddit has taken steps to prevent AI companies from accessing its data through the Internet Archive, significantly limiting how much content gets preserved for future reference. The platform now blocks the Wayback Machine from archiving individual threads, profiles, and comments, restricting it to only capturing screenshots of the homepage. This move comes after Reddit discovered AI firms bypassing direct scraping restrictions by extracting data from archived versions of its content.
Previously, the Internet Archive served as a comprehensive backup, documenting everything from deleted posts to niche subreddit discussions. Now, its utility for researchers and users seeking historical context has been drastically reduced. Instead of preserving full threads, the Wayback Machine will only store daily snapshots of trending headlines and popular posts, offering little insight into user activity or removed content.
While Reddit hasn’t named specific AI companies involved, spokesperson Tim Rathschmidt confirmed that some firms violated platform policies by scraping data indirectly through archived pages. He hinted that the restrictions could be reconsidered if the Internet Archive implements stronger safeguards against unauthorized data collection.
The decision also addresses broader privacy concerns, particularly around deleted content remaining accessible through third-party archives. With these new measures rolling out, Reddit aims to tighten control over its data while balancing transparency and user privacy. The long-term impact on digital preservation, and AI training datasets, remains to be seen.
(Source: Ars Technica)