Reddit CEO Says LLMs Depend on Reddit Data

▼ Summary
– Reddit CEO Steve Huffman stated that large language models would not exist as they do without Reddit’s user-generated content, which he called “modern oil” for AI.
– Reddit has data licensing agreements with Google and OpenAI, but has sued Anthropic, Perplexity, and data-scraping firms for unauthorized use of its content.
– Huffman said Reddit’s openness to data sharing changed because AI companies moved away from open research, making it difficult to track how data was used.
– Reddit uses AI for its “Reddit Answers” search feature, which presents verbatim user quotes, and for content moderation, but not as a replacement for community moderation.
– Huffman noted that users posting AI-written content is a challenge, but Reddit will let communities downvote and reject such content rather than creating a specific policy.
Reddit CEO Steve Huffman made a bold declaration at the Fast Company Most Innovative Companies Summit, asserting that large language models (LLMs) “would not exist as we know them” without the platform’s user-generated content. He described Reddit’s data as “modern oil” for the artificial intelligence industry, a resource he claims is irreplaceable in the current AI ecosystem.
Huffman explained that Reddit is a foundational source for training data used by major AI systems. “There’s no artificial intelligence without actual intelligence,” he said, noting that models essentially “regurgitate on an absolutely massive scale” what they consume. He pointed to Reddit’s natural, human conversation covering nearly every topic as a unique advantage. He also cited data from Profound, a firm that tracks AI citation, to back up his claim that Reddit is the most cited platform across all models.
The company has taken a dual approach to handling AI firms. In 2024, Reddit signed data licensing agreements with Google and OpenAI. Huffman described these as collaborative partnerships, allowing Reddit to set “guard rails on use and access to our data on behalf of our users” while working together on next-generation products. For companies unwilling to negotiate, Reddit has turned to litigation. It sued Anthropic in California Superior Court for unauthorized use of content and filed a federal lawsuit against Perplexity and three data-scraping firms in the Southern District of New York, alleging DMCA anti-circumvention violations. Huffman summed up the policy simply: “Commercial use of our data requires commercial terms.”
This shift in stance, Huffman said, stems from the AI industry’s move away from open research. Historically, Reddit was open and permissive with its data, but as AI companies became less transparent, Reddit could no longer track how its content was being used. Huffman stressed that Reddit wants to prevent its data from being used to identify users, target them with ads, or replace the platform entirely. While commercial access is now restricted, Reddit still offers free data to researchers and universities for non-commercial purposes.
Huffman acknowledged a “paradox” in Reddit’s own use of AI. The platform powers external AI systems while also deploying AI internally. The most visible example is Reddit Answers, an LLM-powered search feature that reads posts and comments and organizes them into responses using verbatim user quotes. Huffman emphasized that it presents multiple perspectives, preserving the human element. Behind the scenes, AI assists with content moderation and classification, evaluating whether comments cross into bullying. Huffman framed this as reducing exposure to the worst content, not replacing community moderation. “The worst job on the internet used to be looking at the worst content on the internet and deciding whether it could be online or not,” he said. “That job just goes away.”
Another challenge Huffman addressed is users writing posts with AI tools and pasting them into Reddit. He distinguished this from bot activity, noting a human is still behind the idea. “The writing sucks,” he admitted, when users rely on AI to compose posts. Rather than creating a new policy, Reddit will let its community handle it. Users are already downvoting and calling out AI-written content, and Huffman said Reddit will “empower the users more and the subreddits more to just reject that sort of content altogether.” He compared the situation to calculators in math class, suggesting society needs to learn alongside the technology.
Huffman’s comments reinforce Reddit’s position that its user discussions are a core input for AI systems. The company is exploring new data deals, though Huffman did not announce a third agreement. Lawsuits against Anthropic and Perplexity remain ongoing, with the Anthropic case subject to a federal court remand hearing in March. Reddit’s approach, balancing partnerships with legal action and letting communities police AI-generated content, sets it apart from platforms that rely on automated detection tools.
(Source: Search Engine Journal)




