Cloudflare Outage Disrupts Internet Due to Bot Management Error

▼ Summary
– Cloudflare initially believed a major DDoS attack caused the outage, suspecting involvement from the Aisuru botnet.
– The actual cause was an internal issue where a database permissions change caused a feature file to double in size and propagate across the network.
– This oversized file caused software failures in Cloudflare’s bot management system, affecting core CDN, security, and other services.
– Cloudflare resolved the issue by stopping the file propagation and replacing it with an earlier version, restoring core traffic flow.
– Full recovery took an additional 2.5 hours to manage increased network load as traffic returned to normal.
A significant internet disruption rippled across countless websites and online services yesterday, stemming from an internal error at Cloudflare rather than the external cyberattack initially suspected. The company’s CEO, Matthew Prince, admitted that his first reaction was to fear a massive distributed denial-of-service (DDoS) assault, even speculating internally about a notorious botnet. However, the real culprit turned out to be a configuration change that caused a critical internal file to unexpectedly double in size.
This oversized file was part of the bot management system, a security feature that relies on machine learning to identify and block malicious traffic. When the file grew beyond its expected limits, it triggered failures in the software responsible for updating the system across Cloudflare’s global network. The result was widespread issues affecting core content delivery, security services, and other vital functions. Prince later clarified that a permissions adjustment in a database led to duplicate entries being written into the feature file, which then propagated throughout their infrastructure.
The software designed to read this file and keep threat protections current had a built-in size limit. Once the file exceeded that limit, the software simply stopped working. This cascading failure highlights how a single internal misconfiguration can have far-reaching consequences for global internet stability. After identifying the root cause, engineers halted the distribution of the oversized file and reverted to a previous, stable version.
Service restoration began almost immediately after the corrected file was deployed, allowing core traffic to largely return to normal. Still, the company needed an additional two and a half hours to manage the surge of returning traffic and stabilize network load. Prince expressed regret for the disruption, acknowledging the widespread inconvenience caused by the incident.
(Source: Ars Technica)





