Artificial IntelligenceBigTech CompaniesNewswireTechnology

Reddit Blocks Internet Archive to Stop AI Data Scraping

Get Hired 3x Faster with AI- Powered CVs CV Assistant single post Ad
▼ Summary

Reddit is blocking the Internet Archive (IA) from indexing its content after discovering AI firms bypassed scraping restrictions by using IA’s archived data.
– Previously, IA’s Wayback Machine archived Reddit pages, profiles, and comments, but now only screenshots of the homepage will be saved, limiting its usefulness.
– Reddit has not named the AI firms involved but confirmed they violated platform policies by scraping data from the Wayback Machine.
– Reddit suggests IA could take steps to prevent AI scraping, potentially leading to lifted restrictions, but scraping blocks are increasing starting now.
– Reddit cites privacy concerns, noting the Wayback Machine archives deleted user content, justifying the new restrictions.

Reddit has taken steps to prevent AI companies from accessing its data through the Internet Archive, significantly limiting how much content gets preserved for future reference. The platform now blocks the Wayback Machine from archiving individual threads, profiles, and comments, restricting it to only capturing screenshots of the homepage. This move comes after Reddit discovered AI firms bypassing direct scraping restrictions by extracting data from archived versions of its content.

Previously, the Internet Archive served as a comprehensive backup, documenting everything from deleted posts to niche subreddit discussions. Now, its utility for researchers and users seeking historical context has been drastically reduced. Instead of preserving full threads, the Wayback Machine will only store daily snapshots of trending headlines and popular posts, offering little insight into user activity or removed content.

While Reddit hasn’t named specific AI companies involved, spokesperson Tim Rathschmidt confirmed that some firms violated platform policies by scraping data indirectly through archived pages. He hinted that the restrictions could be reconsidered if the Internet Archive implements stronger safeguards against unauthorized data collection.

The decision also addresses broader privacy concerns, particularly around deleted content remaining accessible through third-party archives. With these new measures rolling out, Reddit aims to tighten control over its data while balancing transparency and user privacy. The long-term impact on digital preservation, and AI training datasets, remains to be seen.

(Source: Ars Technica)

Topics

reddit blocking internet archive 95% ai firms bypassing scraping restrictions 90% limitations wayback machine archiving 85% privacy concerns archived content 80% potential lifted restrictions safeguards 75% impact digital preservation 70% impact ai training datasets 65%
Show More

The Wiz

Wiz Consults, home of the Internet is led by "the twins", Wajdi & Karim, experienced professionals who are passionate about helping businesses succeed in the digital world. With over 20 years of experience in the industry, they specialize in digital publishing and marketing, and have a proven track record of delivering results for their clients.
Close

Adblock Detected

We noticed you're using an ad blocker. To continue enjoying our content and support our work, please consider disabling your ad blocker for this site. Ads help keep our content free and accessible. Thank you for your understanding!