Reddit Blocks Internet Archive for AI Data

Reddit Blocks Internet Archive to Curtail Unauthorized AI Data Harvesting

By kellyNii Last updated Aug 12, 2025

Reddit has moved to restrict the Internet Archive’s Wayback Machine from archiving nearly all its content after discovering artificial intelligence companies were scraping Reddit data from the historical preservation service. The decision marks a significant escalation in Reddit’s campaign to monetize and control access to its user-generated content in the age of generative AI.

Under the new restrictions, the Wayback Machine can only archive Reddit’s homepage, preventing access to post details, comments, and user profiles. This drastically limits the archive’s utility, reducing it to a snapshot of daily popular headlines rather than a comprehensive record of discussions, subcultures, or deleted content.

The Scraping Backdoor

According to Reddit spokesperson Tim Rathschmidt, the company observed AI firms extracting data from Internet Archive repositories after being blocked from scraping Reddit directly. “We’ve been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine,” Rathschmidt stated. He emphasized that until the Internet Archive can “defend their site and comply with platform policies,” including respecting content removal requests for privacy, the restrictions will remain.

The Internet Archive confirmed ongoing discussions with Reddit but hasn’t detailed potential technical countermeasures against AI scraping. “We have a longstanding relationship with Reddit,” said Wayback Machine director Mark Graham, signaling hopes for a resolution.

Data Licensing: Reddit’s New Revenue Frontier

This move aligns with Reddit’s aggressive strategy to convert its vast data trove into revenue. The platform has:

Reddit Now Requires ID for UK Users to See Adult Content

Struck lucrative deals with Google ($60 million) and OpenAI for AI training data licensing.
Blocked non-paying search engines like Bing and DuckDuckGo from crawling its content.
Sued AI startup Anthropic in June for allegedly scraping posts despite claims of compliance.

Reddit anticipates earning over $200 million from data licensing within three years. As CE, O Steve Huffman asserted, “Commercial use requires commercial terms”.

Digital Preservation vs. Corporate Control

The restriction highlights growing tension between open web preservation and commercial data control. The Wayback Machine has archived over 866 billion web pages since 2001, safeguarding content from link rot 38% of 2013’s web pages are already inaccessible. Researchers, journalists, and users frequently rely on it to track misinformation, study community evolution, or recover deleted discussions.

“Reddit’s decision incinerates historical context,” argued tech historian Claire Evans. “Subreddits document everything from medical support communities to grassroots movements. Letting corporations veto preservation sets a dangerous precedent.”

While Reddit frames this as privacy protection, critics note the timing coincides with its AI licensing push. “User privacy is a valid concern, but the primary motive is profit,” said Stanford Law researcher Evan Brown.

The Internet Archive now faces a dilemma: develop costly scraping safeguards or accept limited access. Meanwhile, Reddit’s stance may inspire similar actions by other platforms, potentially fragmenting the internet’s collective memory.

As AI’s hunger for training data intensifies, the balance between innovation, preservation, and ethics remains fiercely contested. Reddit’s blockade signals that in today’s data economy, even digital history has a price tag.

Subscribe to my whatsapp channel