Skip to content

Cybersecurity Insights

Preparing for the Next Major Cloud Outage

Posted in Blog, Cloud Pen Test, News

Cloud Outage Introduction

In the modern business climate, companies of all sizes rely on cloud platforms to keep their applications and services online around the clock. However, nothing underscores the critical importance of disaster recovery and business continuity quite like a major cloud outage. Such was the case on October 20, 2025, when AWS experienced a significant disruption in its US-EAST-1 region. This incident sent ripples throughout the tech world, affecting everything from popular social media and messaging platforms to cutting-edge fintech services. For companies that depend on cloud services, it was a stark lesson in how quickly even the largest providers can fail, leaving businesses scrambling to keep systems online and customers satisfied.

As a professional services firm working with clients across industries, Tanner offers guidance on strengthening cloud-based disaster recovery (DR) and business continuity (BC). This blog post will outline what happened during AWS’s outage, why such incidents matter to every business, and how you can protect your company in future disruptions.

Understanding the AWS Cloud Outage

The AWS outage began as an isolated problem but quickly evolved into a widespread incident. While AWS confirmed the root cause was in its flagship US-EAST-1 region, the specific issue led to elevated error rates and a surge in latency across key services. Identity services like IAM and STS were hit hard, making it challenging for applications to authenticate users. Network-related tools, such as PrivateLink and VPC Lattice, saw slower performance. Event services, including Kinesis and EventBridge, struggled to process workloads. These foundational services act like the spinal cord in cloud deployments; once they’re affected, the entire body of applications depending on them can suffer.

Major consumer applications went down or experienced severe slowdowns. Popular messaging and communication apps suddenly became inaccessible to millions. Fintech and trading platforms saw a spike in transaction failures. While AWS swiftly issued status updates, many businesses and end-users felt the impact for hours. This event served as a wake-up call: If a single region goes down, your cloud environment could grind to a halt unless you’ve built proper resiliency measures.

Why Cloud Outages Matter to Every Business

It’s easy to assume that an outage in one part of the country or on a single cloud provider won’t affect your operations if your own workloads seem unaffected. In reality, the cloud economy is highly interconnected. If a Software-as-a-Service (SaaS) vendor in your supply chain relies on AWS, your business operations could still be disrupted. Think of it like a series of dominoes; when one critical service provider topples, it can topple many businesses built on its platform.

Moreover, there’s often a disconnect between expectation and reality. Many companies assume high-level “out-of-the-box” cloud features guarantee resilience, believing the provider will handle failover end-to-end. The truth is more complex. While AWS offers powerful tools for redundancy and recovery, they typically require correct configuration, diligent testing, and proactive planning. If you only rely on default setups or minimal SLAs, you could be in for a rude awakening when a major cloud outage hits.

Key Lessons from the AWS Cloud Outage Failure

First and foremost, depending on a single region is a clear point of vulnerability. Cloud providers generally design their regions as isolated fault domains, meaning a glitch in one area shouldn’t automatically impact another. But if all your data and services live in one region, you’re inherently at risk should it fail. Another critical lesson is that platform-level services, such as identity management or event-driven pipelines, are the glue that holds modern applications together. If those core services experience disruptions, the entire ecosystem can suffer widespread fallout, no matter how well-architected.

Perhaps the most important takeaway is that every business must assume the possibility of significant cloud outages or disruptions. Whether you choose a multi-region or multi-cloud approach, planning for the scenario where your primary hosting environment becomes temporarily unavailable is essential. This is about more than just ticking boxes on a DR checklist. It involves actively rehearsing failovers, verifying that your Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are realistic, and ensuring your staff can quickly execute the necessary response steps.

“The AWS outage underscored that relying on built-in redundancy won’t cut it. Businesses need a proactive strategy that anticipates and prepares for regional cloud failures,” says John Pohlman, Director of Cybersecurity Consulting Services at Tanner Security.

Creating a Resilient Cloud Architecture

Building true resilience into your cloud environment requires intentional architecture design. Configuring multi-region failover is often the first step. This means hosting workloads in at least two geographically distinct AWS regions, each capable of handling production traffic if the other goes offline. For some companies, multi-cloud strategies spread the risk even further by adding a second provider. If one cloud provider goes down, another can pick up the slack. While multi-region or multi-cloud configurations may seem more complex, these strategies become crucial investments if uninterrupted uptime is a top priority.

Dependency mapping is another underrated but essential element of resilience. Most businesses run a patchwork quilt of cloud services: storage buckets, queue managers, identity tools, event pipelines, logging systems, and more. Any break in this chain can cause partial or complete downtime. A thorough inventory helps reveal those critical dependencies to craft targeted failover solutions. Equally crucial is regularly scheduling failover tests. It’s not enough to enable AWS’s region replication features; you must simulate failovers—intentionally “pulling the plug” in a staging environment—to see whether your processes and systems can truly handle it.

Immediate Steps to Strengthen Disaster Recovery

After a high-profile outage, we often see a rush to implement quick fixes like enabling cross-region replication for storage or installing additional monitoring tools. While these steps can help, they should be part of a larger, methodical approach. Start by conducting a post-incident impact review: Which services failed? How long did they remain offline? Did the downtime translate to lost sales, unhappy customers, and negative employee productivity? Understanding these factors will guide your priorities and investments in the future.

Next, validate your existing DR protocols. Many businesses boast of having an official DR document, yet that document is either outdated or untested. Ensure your failover runbooks match reality. Check your RTO and RPO: Are they still aligned with business goals? Are the right individuals trained to execute emergency procedures? Run a live failover drill to confirm that the environment in your secondary region can come online quickly. Finally, review your contractual Service Level Agreements (SLAs) with cloud providers to clarify your options if expansions or other resilience measures are warranted. Communicating the lessons learned and the following steps to stakeholders, both teams within your organization and external customers, is a powerful way to boost confidence and accountability.

Long-Term Resilience Strategies and Best Practices

Preparing for future outages isn’t a one-and-done exercise. It’s a continuous discipline that evolves with technology and business demands. When designing cloud environments, consider region-independent and provider-independent architectures wherever possible. Keep in mind that to effectively produce a failover to another provider, your workloads must not be overly reliant on proprietary features specific to a single cloud platform.

Ongoing DR and BC exercises can highlight vulnerabilities before they create real-world disasters. Routine tabletop drills or brownout simulations can reveal hidden weak points, like a misconfigured network setting or an overlooked SaaS dependency. Diversifying your cloud vendors is also viable if risk mitigation exceeds the increased management complexity or cost. Lastly, resilience isn’t a function you can delegate solely to the IT department. Boards and executive teams should monitor resilience metrics, allocate adequate budgets, and establish that outage preparedness is a strategic priority.

Cloud Outages Conclusion

The October 2025 AWS US-EAST-1 outage is a reminder that cloud failures can and do happen, even with top-tier providers. Rather than viewing outages as rare, unpredictable events, companies should treat them as an operational certainty that mandates carefully crafted DR and BC measures. You can dramatically reduce your vulnerability during major disruptions by designing multi-region or multi-cloud strategies, diligently mapping dependencies, and routinely testing failovers. A good plan is not just about technology; it’s about operational readiness, clear communication, and decisive execution when the unexpected occurs.

At Tanner Security, we believe in partnering with clients to craft tailored resilience solutions that meet both their technical requirements and business realities. If you want to strengthen your disaster recovery framework, we can help you navigate the complex world of cloud architecture and develop a strategy designed to weather any storm.

Schedule a Call

Name*
Please let us know what's on your mind. Have a question for us? Ask away.