Solutions
Optimize Your IT Infrastructure
Optimize Your IT Infrastructure

Get the best possible ROI out of your IT investments with greater efficiencies and improved operations.
Extend Your IT Team
Extend Your IT Team

Ensure uptime and performance with the 24/7/365 support of our world-class cloud & colocation engineers.
Get To The Cloud Faster
Get To The Cloud Faster

Launch your journey to hybrid and multi-cloud faster, without compromising security.
Protect Your Business
Protect Your Business

Rest easy with business continuity backed by one of the safest networks and strongest compliance programs.
Services
Cloud Services
Cloud Services

LightEdge Cloud
LightEdge Cloud

Edge Cloud
Edge Cloud

IBM i Cloud
IBM i Cloud

Bare Metal Cloud
Bare Metal Cloud

Cloud Storage
Cloud Storage
Managed Services
Managed Services

LightEdge Cloud Recovery
LightEdge Cloud Recovery

Backup-as-a-Service
Backup-as-a-Service

AWS Managed Services
AWS Managed Services

Azure Managed Services
Azure Managed Services
Colocation Services
Colocation Services

Cabinets
Cabinets

Shared Colo
Shared Colo

Private Suites
Private Suites

Cage
Cage

Remote Hands
Remote Hands
Connectivity
Connectivity

LightEdge Internet
LightEdge Internet

Cloud Connectivity
Cloud Connectivity

Data Center Connectivity
Data Center Connectivity
Data Centers
Austin, TX
Austin, TX

Minneapolis, MN
Minneapolis, MN
Des Moines, IA
Des Moines, IA

Omaha, NE
Omaha, NE
Kansas City, MO
Kansas City, MO

Phoeniz, AZ
Phoeniz, AZ
Lenexa, KS
Lenexa, KS

San Diego, CA
San Diego, CA
Why LightEdge?
Resources
Cloud Assessment
Cloud Assessment

Specializing in alignment to ensure your cloud journey is a well thought out path to success.
Blogs
Blogs

Information and insights that can help on your path to digital transformation.
Case Studies
Case Studies

Proud to be the trusted IT adviser to organizations in a number of verticals.
In The News
In The News

Keep up with new announcements on exciting news and developments.
Whitepapers
Whitepapers

Resources created by LightEdge subject matter experts.

Building Resilient IT Systems: Lessons from the CrowdStrike IT Outage

July 19, 2024

The developing global IT outage, triggered by a faulty update from CrowdStrike and compounded by issues with Microsoft’s Azure services, has revealed significant vulnerabilities in IT infrastructure. This incident affected multiple sectors, including airlines, hospitals, and retailers, offering vital lessons for CIOs on improving IT resilience and update management.

The crisis began in Australia, where banks, airlines, and TV broadcasters reported Blue Screens of Death (BSOD) on Windows devices. As the day progressed, the problem spread to Europe and the US, impacting businesses and critical services. UK broadcaster Sky News and airlines like Ryanair faced significant operational disruptions, while US airlines required FAA assistance. Hospitals in Germany had to cancel surgeries, and 911 emergency call centers in Alaska experienced outages. In the UK, NHS England encountered issues with GP appointment systems and patient records.

The root cause of the outage was a faulty update from CrowdStrike’s Falcon Sensor software, which caused Windows machines to crash and enter a recovery boot loop. CrowdStrike identified the issue as a software malfunction rather than a cyberattack, illustrating that poor update management and monitoring can be as detrimental in causing system outages as inadequate cybersecurity measures.

Brody Nisbet, the director of overwatch at CrowdStrike, posted on X that a workaround fix has been productive in some cases. This involves booting Windows machines into safe mode, finding and deleting a specific system file (C-00000291*.sys), and then rebooting normally. This highlights the complexities and manual interventions sometimes required to address such widespread issues.

Key Strategies for CIOs to Prevent Future IT Outages

To prevent similar disruptions in the future, CIOs must adopt a more strategic and comprehensive approach to IT management. Here are key imperatives that can enhance resilience and ensure operational continuity:

A comprehensive update management approach is critical: CIOs must implement rigorous pre-deployment testing across various environments and configurations to detect potential issues early. Using staging environments that replicate production setups allows for thorough testing of updates. This process should include automated testing, manual testing, and regression testing to ensure that new updates do not interfere with existing functionalities.

Phased deployment can mitigate risks: By rolling out updates in phases to a small group initially, organizations can monitor and address issues before a full-scale deployment. Ensuring robust rollback procedures are in place to quickly revert to a stable version if problems arise is also crucial. Automated rollback capabilities can further enhance this strategy, allowing for faster recovery without significant manual intervention.

Enhanced monitoring and incident response are essential: Utilizing advanced monitoring tools to detect anomalies immediately post-deployment enables rapid intervention. Real-time monitoring and alerting systems should be in place to catch issues as they occur. Developing detailed incident response plans with clear protocols for quick identification, isolation, and resolution of issues is vital. These plans should include root cause analysis and post-incident reviews to continuously improve response strategies.

Avoid single points of failure: Diversifying solutions enhances overall resilience. Implementing redundancy and failover mechanisms ensures that critical systems remain operational even if one component fails. Adopting a hybrid or multi-cloud infrastructure can significantly reduce the risk of a single point of failure by distributing workloads across multiple environments, enhancing redundancy, flexibility, and disaster recovery capabilities. Load balancing and geographic distribution of resources can further mitigate risks associated with localized failures.

Continuously assess infrastructure resilience and disaster recovery plans: This proactive approach ensures that systems are prepared to handle future disruptions effectively. Regularly testing disaster recovery plans through simulated drills can identify weaknesses and areas for improvement. Partnering with reliable providers can further enhance preparedness and response capabilities by leveraging their expertise and resources.

These strategies are nothing new, but outages like this always serve as a reminder of their importance. By following these best practices, CIOs can build a more resilient infrastructure capable of withstanding unforeseen challenges.

Ensuring Future Resilience

At LightEdge, we believe in empowering organizations with the resilience needed to navigate and overcome disruptions like the recent global IT outage. As a leading provider of hybrid and multi-cloud services, we offer solutions designed to support a more resilient infrastructure. Our technology-agnostic approach ensures that organizations achieve the flexibility and redundancy necessary to maintain critical application availability during outages.

Our managed services include rigorous update management, patching, and 24/7/365 monitoring to reduce the risk of disruptions. Additionally, LightEdge’s expertise in disaster recovery and business continuity ensures that organizations can recover quickly from any interruptions, maintaining operational stability.

LightEdge’s global network of data centers is designed for complete resilience and offers load balancing, geographic distribution, and automated failover mechanisms. By helping customers adopt hybrid and multi-cloud strategies, we provide flexible and adaptable infrastructure solutions that reduce the risk of single points of failure.

This outage underscores the critical need for improved update management and resilient IT infrastructure. By adopting strategic best practices and partnering with experts like LightEdge, CIOs can strengthen their organization’s resilience, ensuring continuous operations despite unforeseen disruptions.

If you’re looking for a trusted partner to help you adopt a secure and resilient hybrid or multi-cloud architecture, connect with one of our specialists today.

Building Resilient IT Systems: Lessons from the CrowdStrike IT Outage

Key Strategies for CIOs to Prevent Future IT Outages

Ensuring Future Resilience

GET THE LATEST INSIGHTS FROM LIGHTEDGE EXPERTS

Share Article

Why Multi-Cloud Strategy Beats Single Cloud Almost Every Time

How Would You Migrate a Data Center to Cloud?

Measuring the Impact of Site Reliability Engineering: Strategic Value and KPIs