Should You Block AI Crawlers or Measure Their Value First?

▼ Summary
– AI crawlers fall into three categories: training bots (e.g., GPTBot) that collect data for model training, search indexing bots (e.g., OAI-SearchBot) that surface sites in LLM search results, and user-triggered fetches (e.g., ChatGPT-User) that retrieve pages on demand.
– Blocking AI bots risks losing brand visibility in LLM answers, while allowing them can lead to high crawl costs and unauthorized use of intellectual property.
– To measure value, analyze log files and referral traffic to identify which bots visit, assess engagement metrics and revenue from LLM-referred visitors, and track citations and brand sentiment in AI-generated answers.
– Build a decision matrix for each bot based on criteria such as whether it provides revenue or visibility, accesses sensitive information, is trustworthy, incurs significant costs, or risks competitive disadvantage if blocked.
– Regularly review your bot block list quarterly, treating each crawler as an individual business case to balance current resource protection with future discoverability.
The cost of letting AI bots crawl your website is no longer just a technical footnote. As these crawlers multiply, the expense of granting unrestricted access to your content is becoming a significant line item for many businesses. The real question isn’t whether to block them, but how to measure their actual value before pulling the plug.
Not all bots are created equal. The bots you likely welcome, like those from Google or Bing, are essential for traditional search visibility. Then there are tools from uptime monitors, analytics services, and security scanners. The easy decisions involve blocking malicious scrapers or bots stealing product data. But AI crawlers occupy a gray area, making the choice far more complex.
The Three Faces of AI Crawlers
Understanding the different types of AI bots is the first step. AI training bots, such as OpenAI’s GPTBot, are the most controversial. They consume content to build knowledge bases for large language models, rarely sending traffic back to your site. This makes it difficult to prove their business value.
Search indexing bots, like OpenAI’s OAI-SearchBot, operate more like traditional search engines. They index your content to surface it in AI-generated search results, offering a clearer path to referral traffic and brand awareness.
User-triggered fetchers, including ChatGPT-User, are a different beast. They retrieve your pages on demand when a user asks about your site. This is a strong signal of direct user interest, placing your brand further down the purchase funnel.
The Blocking Dilemma: Robots.txt Isn’t Enough
Traditionally, SEOs have used robots.txt to control bot access. However, OpenAI now states that ChatGPT-User will not honor robots.txt commands. Perplexity behaves similarly. This means you must turn to server-level or WAF-level blocking for non-compliant bots. A web application firewall, like Cloudflare, acts as a gatekeeper, inspecting each request. Server rules can also block bots lacking proper headers or showing signs of automation.
The risk of blocking all AI bots is significant. You may lose citations in LLM answers, putting your brand at a competitive disadvantage. While referral traffic from LLMs is currently low, they are powerful for raising brand awareness. If your competitor gets cited instead, you lose that exposure. And as LLMs evolve into primary discovery channels, a blocked site today could become functionally invisible tomorrow.
The Cost of Allowing Everything
Conversely, allowing all AI crawlers has real downsides. The most pressing is the crawl cost. AI bots can consume server resources at a ferocious rate. Cloudflare data from June 2025 shows that for every one visit to a website, Anthropic’s Claude makes 70,900 page requests. Compare that to Google’s ratio of 9.4 requests per visit. This excessive crawling can impact real user experience and inflate hosting fees.
Then there’s the intellectual property risk. Your proprietary content may be used to train models without compensation or attribution. For publishers and artists, this can directly threaten their business model.
How to Identify and Measure Bot Activity
The biggest hurdle is knowing which bots are visiting. Log files are your most complete source. Download a sample from the past 30 days and identify AI crawlers by their user-agent strings. Tools like log file analyzers or AI visibility trackers can automate this process.
If you lack log file access, check referral traffic in your analytics. Google Analytics now has an “AI Assistant” channel classification for ChatGPT, Gemini, and Claude, but it doesn’t capture Perplexity. This method only shows bots that have sent traffic, missing those that crawl without a click-through.
To measure value, go beyond session numbers. Compare engagement metrics of LLM-referred visitors to those from other channels. Are they converting? Do they fill out lead forms? Track how often your site appears in AI-generated answers for relevant topics. Also, assess the sentiment of those mentions. A bot that misrepresents your brand could be damaging your reputation.
Building a Decision Matrix
The final step is creating a decision matrix for each bot. Ask these questions:
- Does this bot provide converting revenue or useful visibility?Based on the answers, classify each bot into one of three categories:
- Keep: Provides measurable value that outweighs costs.This is not a “set it and forget it” exercise. Review your block list quarterly to adapt to new bots and changing platform value. The key takeaway is to treat each AI crawler as an individual business case. Measure its cost, assess the visibility it provides, and make a deliberate, informed decision. This approach protects both your current resources and your future discoverability.




