BigTech CompaniesCybersecurityNewswireTechnology

Cloudflare Outage: Database Glitch Caused Widespread Disruption

▼ Summary

– Cloudflare experienced its worst outage in 6 years, lasting nearly 6 hours and blocking access to many websites due to a database permissions change causing a cascading failure.
– The outage was not caused by a cyberattack but by a routine database update that generated an oversized configuration file exceeding system limits.
– The oversized file caused the Bot Management system’s Rust code to crash, triggering 5xx errors and disrupting core traffic processing across Cloudflare’s global network.
– Service was restored after engineers identified the root cause and replaced the problematic file, with full recovery achieved within approximately 6 hours.
– This incident affected multiple Cloudflare services including core CDN, security features, and dashboard access, marking their most significant outage since 2019.

A significant disruption impacted countless websites and online services for nearly six hours on Tuesday, marking Cloudflare’s most severe outage in six years. The problem originated from a database permissions adjustment that set off a chain reaction of failures across the company’s extensive global infrastructure. This network, which spans servers in over 120 countries, delivers critical content delivery, security, and performance services to more than 13,000 networks worldwide, including major internet providers and cloud platforms.

Cloudflare CEO Matthew Prince confirmed in a detailed incident report that the widespread service issues were not the result of any cyberattack. He explained that a routine update to database access controls inadvertently caused the database to produce duplicate entries. These duplicates were then written into a configuration file used by the Bot Management system, which identifies and manages automated traffic.

The incident started at 11:28 UTC. The flawed database query resulted in the Bot Management system generating an oversized configuration file. This file contained duplicate column metadata, effectively doubling the number of features listed from about 60 to over 200. Unfortunately, the system was designed with a hardcoded limit of 200 features to prevent excessive memory use. When the file exceeded this limit, it caused the software to crash, disrupting traffic routing across Cloudflare’s network.

Every five minutes, automated queries would generate either correct or faulty configuration files depending on which network clusters had received updates. This led to a fluctuating situation where parts of the network would work intermittently before failing again. As the oversized file spread to machines throughout the network, the Bot Management module, written in Rust, triggered a system panic. This resulted in widespread 5xx HTTP errors and caused the core proxy system responsible for processing traffic to crash.

Engineers managed to restore core traffic flow by 14:30 UTC after identifying the root cause and rolling back the problematic configuration file to a stable earlier version. Full service was reinstated across all systems by 17:06 UTC. The outage affected a broad range of Cloudflare services, including its core content delivery and security offerings, Turnstile, Workers KV, customer dashboard access, email security, and access authentication systems.

Prince expressed regret for the disruption caused to customers and the broader internet community. He emphasized that any interruption of Cloudflare’s services is unacceptable given the company’s crucial role in the internet ecosystem. He noted that this was the most significant outage since 2019, surpassing previous incidents that only affected specific features or the dashboard. In over six years, no other event had halted the majority of core traffic flowing through their network.

This incident follows another substantial outage Cloudflare experienced in June, which caused connectivity problems with Zero Trust WARP and Access authentication across several regions, also impacting Google Cloud infrastructure. In a separate event last October, Amazon addressed a major outage caused by a DNS failure that disrupted millions of websites relying on its Amazon Web Services cloud platform.

(Source: Bleeping Computer)

Topics

cloudflare outage 100% database permissions 95% cascading failure 90% bot management 88% service disruption 85% configuration files 85% system panic 82% 5xx errors 80% global network 78% root cause 75%