Google to Propose New Unsupported robots.txt Rules

▼ Summary
– Google plans to document the 10 to 15 most common unsupported robots.txt rules based on real-world data from HTTP Archive, rather than adding only the two tags proposed by a community member.
– The research used a custom JavaScript parser to extract robots.txt rules line by line from HTTP Archive data, after discovering that robots.txt files are not typically requested during default crawls.
– Beyond the four supported fields (user-agent, allow, disallow, sitemap), usage of other rules drops drastically into a long tail of less common directives and junk data.
– Google may expand its typo tolerance for common misspellings of the disallow rule, based on analysis of real-world robots.txt data.
– The update will affect Google’s public documentation, making it better reflect the unrecognized tags already surfaced in Search Console, and may prompt webmasters to audit their robots.txt files.
Google is preparing to update its documentation on unsupported robots.txt rules, based on real-world data collected through HTTP Archive. The move could help webmasters better understand which directives Google ignores and why.
On a recent episode of Search Off the Record, Google’s Gary Illyes and Martin Splitt detailed the project. It began when a community member submitted a pull request to Google’s robots.txt repository, suggesting two new tags be added to the list of unsupported directives. Rather than limit the update to just those two, Illyes explained the team decided to take a broader, data-driven approach:
“We tried to not do things arbitrarily, but rather collect data.”
The goal was to identify the top 10 or 15 most-used unsupported robots.txt rules in the wild, creating what Illyes called “a decent starting point, a decent baseline” for documentation.
To gather this data, the team turned to HTTP Archive, which runs monthly crawls across millions of URLs using WebPageTest. The initial plan hit a snag: the default crawl doesn’t request robots.txt files, so the datasets lacked that content. After consulting with Barry Pollard and the HTTP Archive community, the team built a custom JavaScript parser to extract robots.txt rules line by line. The custom metric was integrated before the February crawl, and the data now lives in the custom_metrics dataset in Google BigQuery.
The parser captured every line matching a field-colon-value pattern. Illyes described the distribution as heavily skewed: “After allow and disallow and user agent, the drop is extremely drastic.” Beyond those three fields, usage falls into a long tail of less common directives, mixed with junk data from broken files returning HTML instead of plain text.
Currently, Google supports only four fields in robots.txt: user-agent, allow, disallow, and sitemap. The documentation notes that other fields “aren’t supported” but doesn’t list which ones are most common. Google has clarified that unsupported fields are simply ignored. The new project extends that clarity by identifying specific rules for documentation.
Illyes said the top 10 to 15 most-used rules beyond those four will likely be added to Google’s list of unsupported directives. He did not name specific rules, but the analysis also surfaced common misspellings of the disallow rule. Illyes hinted at expanding typo tolerance: “I’m probably going to expand the typos that we accept.” He didn’t commit to a timeline or name specific typos.
The update matters because Google Search Console already surfaces some unrecognized robots.txt tags. By documenting more unsupported directives, Google can make its public resources better reflect what users already see. For anyone maintaining a robots.txt file with rules beyond user-agent, allow, disallow, and sitemap, now is a good time to audit for directives that have never worked for Google.
The HTTP Archive data is publicly queryable on BigQuery, giving anyone the chance to examine the distribution directly.
(Source: Search Engine Journal)


