Artificial IntelligenceBigTech CompaniesNewswireTechnology

Reddit Blocks Internet Archive to Stop AI Data Scraping

▼ Summary

Reddit is blocking the Internet Archive (IA) from indexing its content after discovering AI firms bypassed scraping restrictions by using IA’s archived data.
– Previously, IA’s Wayback Machine archived Reddit pages, profiles, and comments, but now only screenshots of the homepage will be saved, limiting its usefulness.
– Reddit has not named the AI firms involved but confirmed they violated platform policies by scraping data from the Wayback Machine.
– Reddit suggests IA could take steps to prevent AI scraping, potentially leading to lifted restrictions, but scraping blocks are increasing starting now.
– Reddit cites privacy concerns, noting the Wayback Machine archives deleted user content, justifying the new restrictions.

Reddit has taken steps to prevent AI companies from accessing its data through the Internet Archive, significantly limiting how much content gets preserved for future reference. The platform now blocks the Wayback Machine from archiving individual threads, profiles, and comments, restricting it to only capturing screenshots of the homepage. This move comes after Reddit discovered AI firms bypassing direct scraping restrictions by extracting data from archived versions of its content.

Previously, the Internet Archive served as a comprehensive backup, documenting everything from deleted posts to niche subreddit discussions. Now, its utility for researchers and users seeking historical context has been drastically reduced. Instead of preserving full threads, the Wayback Machine will only store daily snapshots of trending headlines and popular posts, offering little insight into user activity or removed content.

While Reddit hasn’t named specific AI companies involved, spokesperson Tim Rathschmidt confirmed that some firms violated platform policies by scraping data indirectly through archived pages. He hinted that the restrictions could be reconsidered if the Internet Archive implements stronger safeguards against unauthorized data collection.

The decision also addresses broader privacy concerns, particularly around deleted content remaining accessible through third-party archives. With these new measures rolling out, Reddit aims to tighten control over its data while balancing transparency and user privacy. The long-term impact on digital preservation, and AI training datasets, remains to be seen.

(Source: Ars Technica)

Topics

reddit blocking internet archive 95% ai firms bypassing scraping restrictions 90% limitations wayback machine archiving 85% privacy concerns archived content 80% potential lifted restrictions safeguards 75% impact digital preservation 70% impact ai training datasets 65%