Azure Outage Ends, But Issues Remain: What Went Wrong

▼ Summary
– Microsoft Azure experienced a global outage on October 29, 2025, affecting all regions and multiple customer-facing services.
– The outage was caused by an inadvertent configuration change in Azure Front Door that bypassed safety checks due to a software defect.
– Recovery involved deploying a “last known good” configuration and gradually rebalancing traffic, with full restoration achieved by 8:05 p.m. ET.
– Services impacted included Microsoft 365, Xbox, Teams, Azure Portal, and various enterprise systems from airlines and other organizations.
– Microsoft has implemented additional validation and rollback controls to prevent similar incidents in the future.
Microsoft Azure experienced a significant global service disruption on October 29, affecting numerous customer-facing platforms and applications. The outage began around noon Eastern Time and persisted for approximately eight hours, with recovery efforts continuing even after Microsoft declared services largely restored. This incident marks the second major cloud service failure within a week, following Amazon Web Services’ recent downtime.
Network monitoring firm ThousandEyes detected HTTP timeouts, server error codes, and elevated packet loss at the edge of Microsoft’s network during the outage. These issues prevented successful connections to affected services, with many requests timing out or returning service-related errors.
Microsoft’s status update at 5:30 p.m. ET indicated the company had initiated deployment of its “last known good” configuration, which completed successfully. However, the technology giant cautioned that recovery would be gradual to ensure system stability. The company anticipated full restoration by 7:30 p.m. ET, though actual recovery extended until approximately 8:05 p.m.
Even after declaring services operational, Microsoft warned that customer configuration changes to Azure Front Door remained temporarily blocked. The company also acknowledged that a small number of customers might continue experiencing issues, describing this as a “long tail” problem requiring ongoing mitigation efforts.
Unlike the recent AWS outage that affected a single region, the Azure disruption impacted all global regions simultaneously. Microsoft’s initial investigation pointed to an inadvertent configuration change within Azure Front Door as the trigger event. This change caused an invalid configuration state that resulted in numerous AFD nodes failing to load properly, leading to increased latencies, timeouts, and connection errors across downstream services.
The faulty configuration deployment bypassed safety validations due to a software defect in Microsoft’s protection mechanisms. As unhealthy nodes dropped from the global pool, traffic distribution became imbalanced, amplifying the impact and causing intermittent availability issues even in partially healthy regions.
Popular consumer and business services experienced significant disruptions, including Microsoft 365, Microsoft Teams, Xbox Live, and Minecraft. Enterprise customers also felt the impact, with Alaska Airlines reporting interruptions to critical internal systems, including its website and operational infrastructure. Vodafone in the UK and Heathrow Airport were among other organizations affected by the outage.
Behind the scenes, numerous Azure services experienced problems, including App Service, Azure Active Directory B2C, Azure SQL Database, and Microsoft Sentinel. Telecommunications analyst Luke Kehoe noted this marked the second such event in a month, highlighting systemic risks associated with concentration and single points of logical failure in cloud infrastructure.
The timing proved particularly unfortunate for Microsoft, which reported strong quarterly earnings the same day, including approximately 40% growth in Azure income. However, the company’s stock declined in after-market trading following the outage disclosure and Microsoft’s admission that it struggles to keep pace with AI and cloud demands.
Microsoft has implemented additional validation and rollback controls to prevent similar incidents, though the event underscores the broader industry challenge of maintaining reliability in increasingly complex cloud environments.
(Source: NewsAPI AI & Machine Learning)





