Major News Outlets Block AI Bots from Training on Content

▼ Summary
– A majority (79%) of top news publishers block AI training bots, and 71% also block retrieval bots, which prevents their content from appearing in AI-generated answers.
– Blocking retrieval bots like OAI-SearchBot or Claude-Web means a site may not be cited by AI tools, even if its content was used to train the underlying model.
– The robots.txt file is only a directive that bots can ignore, as demonstrated by Perplexity allegedly using stealth techniques to bypass these blocks.
– There is a significant enforcement gap, making CDN-level blocking or bot fingerprinting more effective for publishers serious about stopping AI crawlers.
– The high rate of retrieval bot blocking is notable because it affects current AI answer visibility, while training blocks only impact future model development.
A significant majority of leading news organizations are now actively preventing artificial intelligence systems from using their content for training purposes. However, this defensive strategy carries a notable side effect: many of these same publishers are also blocking the specialized bots that AI tools use to find and cite sources in real-time, potentially making their journalism invisible within AI-generated answers. A recent analysis of the top 100 news websites in the United States and United Kingdom reveals that 79% block at least one AI training bot, while 71% also block at least one retrieval or live search bot. This dual-layer blocking creates a complex landscape where publishers may withhold their content from both the development of future AI models and the current tools that could drive traffic through citations.
The study, which examined the robots.txt files of these major sites, categorized the automated bots into three groups: those that gather data for model training, those that retrieve information for live user queries, and those that index web pages. The findings show a clear trend of resistance. Among training bots, Common Crawl’s CCBot was blocked by 75% of sites. Other major AI players like Anthropic-ai (72%), ClaudeBot (69%), and OpenAI’s GPTBot (62%) also faced high blockage rates. Interestingly, Google-Extended, the crawler used to train the Gemini AI, was the least blocked training bot overall at 46%, with a significant disparity between US publishers (58% blockage) and UK publishers (29% blockage).
The rationale for this widespread blocking is often rooted in business survival. As one SEO director explained, publishers see little value in allowing large language models to train on their content because these AI systems are not designed to send referral traffic. For news organizations that still rely heavily on web visits for revenue, the exchange feels fundamentally one-sided. This economic reality is driving the technical decision to issue “disallow” directives in their robots.txt files.
Perhaps more impactful for immediate visibility is the blocking of retrieval bots. These are the systems that fetch live web content when a user asks a question to an AI assistant like ChatGPT or Claude. The analysis found that Claude-Web was blocked by 66% of sites, while OpenAI’s OAI-SearchBot was blocked by 49%. By taking this step, publishers are not only opting out of training future AI but also opting out of the citation layer that allows AI tools to surface their work as a source. If a site blocks these retrieval bots, its content may not appear in AI-generated answers, even if the underlying model was trained on that site’s earlier content.
A critical limitation in this entire approach is that a robots.txt file is merely a request, not an enforceable barrier. It functions like a “please keep out” sign on an open field; a bot can simply choose to ignore it. Industry experts point out that many crawlers flagrantly disregard these directives. This enforcement gap was highlighted recently when Cloudflare documented that Perplexity AI used techniques like rotating IP addresses and spoofing its user agent to bypass robots.txt restrictions, leading Cloudflare to actively block the bot, a claim Perplexity disputed. For publishers determined to stop AI crawlers, more robust measures like content delivery network (CDN) level blocking or sophisticated bot fingerprinting may be necessary.
The implications are substantial. The separation of bot functions means a publisher’s blocking choices require precision. Blocking a training bot like GPTBot does not automatically block OpenAI’s separate retrieval bot, OAI-SearchBot. Consequently, a news site’s decision to block retrieval bots directly influences where AI tools can pull citations. This affects public access to authoritative information within AI platforms. The differing blockage rates for Google-Extended also suggest varying strategic calculations by publishers in different regions regarding their relationship with a dominant tech giant.
Looking forward, the reliance on robots.txt alone appears insufficient for those serious about controlling AI access. The retrieval bot category is particularly crucial to monitor, as these blocks determine whether content appears in AI answers today, whereas training blocks affect the models of tomorrow. As the tension between content creators and AI developers continues, the technical arms race over web crawling is likely to intensify, with publishers seeking more definitive tools to protect and control their valuable digital assets.
(Source: Search Engine Journal)