AWS outage in northern Virginia from overheating disrupts Coinbase

▼ Summary
– A cooling system failure in an AWS data centre in northern Virginia caused service disruptions, forcing traffic rerouting and delaying full recovery.
– Coinbase confirmed its trading platform issues were due to the AWS outage, with markets restored after several hours.
– The northern Virginia region, US-East-1, is AWS’s oldest and busiest, accumulating workloads and customer inertia since 2006.
– This incident follows a larger AWS outage in October 2023 due to a DNS failure, highlighting recurring single-site bottlenecks.
– AWS advises customers to avoid running everything in one Availability Zone, a recommendation often ignored despite being a standard best practice.
A single overheating data centre in northern Virginia triggered service disruptions for major platforms on Thursday, as Amazon Web Services struggled to restore full operations after its cooling system proved inadequate. The incident affected Coinbase and potentially CME Group, highlighting persistent vulnerabilities in the concentrated cloud infrastructure that underpins much of the modern internet.
The problem began when rising temperatures inside one of AWS’s northern Virginia facilities, caused by a cooling system shortfall, forced the company to throttle and redirect traffic away from the affected Availability Zone. AWS confirmed that engineers were still working to bring the site fully back online late into the night, with most users already offline. While additional cooling capacity came online within a couple of hours and early signs of recovery emerged, a later update was less optimistic. Bringing enough extra cooling to safely restart remaining systems took longer than anticipated, and AWS declined to provide a timeline for full restoration.
Coinbase acknowledged that its trading platform issues stemmed from the AWS event. After several hours of degraded markets, the exchange reported that all markets had been re-enabled and trading had returned to normal. CME Group, the world’s largest derivatives marketplace, also reported problems with its CME Direct platform during the same period, though it attributed the disruption to “essential maintenance” and did not explicitly link it to the AWS failure. Both companies declined further comment outside business hours.
The affected region, known as US-East-1 in AWS terminology, is the company’s oldest, busiest, and most concentrated cluster. An Availability Zone in this region groups one or more physical data centres designed to operate independently. AWS’s standard recovery guidance was to fail over to another zone, a strategy that works well for customers who have built for it but less so for those who have not. This pattern is becoming familiar. Last October, a DNS resolution failure in DynamoDB cascaded across more than a hundred services, taking down platforms including Snapchat, Reddit, United Airlines, and Coinbase for roughly fourteen hours in the largest internet-wide disruption since the CrowdStrike software malfunction of 2024. A month later, CME suffered one of its longest trading outages in years, traced back to a cooling failure at a CyrusOne data centre in the Chicago area.
The repetition is significant. Cooling failures, configuration errors, and DNS misfires are different technical events, but they share a common outcome: a single physical or logical site becomes the bottleneck for an outsized share of public-facing traffic. The northern Virginia region carries that load more by historical accident than design. AWS launched the region in 2006, and US-East-1 has accumulated workloads, regulatory dependencies, and customer inertia ever since. While hyperscalers are spending tens of billions to expand other regions, customer concentration in US-East-1 is unlikely to shift quickly.
Coinbase’s exposure to the cloud sits inside a longer arc of vulnerability. The Cloudflare-driven outage that took down Coinbase and other exchanges in 2019 was a different failure mode but delivered the same lesson, and it has driven crypto exchanges to spend years architecting for multi-region failover. Thursday’s incident demonstrates that even with that work, a single warm-room shutdown still ripples into a market that is supposed to be open around the clock.
CME’s situation is more delicate. Derivatives markets sit on top of complex margin and clearing pipelines that do not gracefully degrade easily. An outage at peak Asia hours, as Thursday’s was, hits clearing-cycle deadlines that move money the next morning. Whether the CME issue was directly tied to the AWS event will determine how the trading-resilience conversation lands with regulators.
AWS has not estimated the number of affected workloads, and Amazon has not yet explained why the cooling system fell behind, whether the issue was equipment, ambient conditions, or a combination. The northern Virginia region has spent the past year absorbing a wave of new AI-training and inference capacity, which runs hotter and denser than traditional cloud workloads. Whether that is incidentally relevant to Thursday’s failure or substantively part of the cause is the question the post-incident report will need to address.
For most customers, the fix remains the one AWS recommended in its first update: stop running everything in a single Availability Zone in a single region. That advice has been on AWS’s own architecture-best-practice page for years. Each failure of this kind raises the cost of having ignored it.
(Source: The Next Web)