Everyone knows now how a flawed update crashed 8.5 million computers running the Windows version of CrowdStrike’s Falcon cybersecurity software — but what does the failure of one company’s software testing regime mean for the IT industry as a whole? Experts and analysts say that the idiosyncrasies of the technology sector mean it could easily happen again.

Quality vs speed

CrowdStrike has given its version of events leading up to the July 19 crash.

But for independent IT expert Fernando Maldonado, one of the causes of CrowdStrike’s failure is in the way the cybersecurity industry competes with cybercriminals. “There is a race to always cover the latest threats. So, to close the window between when a threat is discovered and when you cover it, you have to pick up a certain speed,” which can lead to a lack of attention to the quality of this update, he says.

Cybersecurity vendor CrowdStrike initiated a series of computer system outages across the world on Friday, July 19, disrupting nearly every industry and sowing chaos at airports, financial institutions, and healthcare systems, among others.

At issue was a flawed update to CrowdStrike Falcon, the company’s popular endpoint detection and response (EDR) platform, which crashed Windows machines and sent them into an endless reboot cycle, taking down servers and rendering ‘blue screens of death’ on displays across the world.

How did the CrowdStrike outage unfold?

Australian businesses were among the first to report encountering difficulties on Friday morning, with some continuing to encounter difficulties throughout the day. Travelers at Sydney Airport experienced delays and cancellations. At 6pm Australian Eastern Standard Time (08:00 UTC), Bank Australia posted an announcement to its home page saying that its contact center services were still experiencing problems.

Businesses across the globe followed suit, as their days began. Travelers at airports in Hong Kong, India, Berlin, and Amsterdam encountered delays and cancelations. The Federal Aviation Administration reported that US airlines grounded all flights for a period of time, according to the New York Times.

What has been the impact of the CrowdStrike outage?

As one of the largest cybersecurity companies, CrowdStrike’s software is very popular among businesses across the globe. For example, over half of Fortune 500 companies use security products from CrowdStrike, which CSO ranks No. 6 on its list of most powerful cybersecurity companies.

Because of this, fallout from the flawed update has been widespread and substantial, with some calling it the “largest IT outage in history.”

To provide scope for this, more than 3,000 flights within, into, or out of the US were canceled on July 19, with more than 11,000 delayed. Planes continued to be grounded in the days since, with nearly 2,500 flights canceled within, into, or out of the US, and more than 38,000 delayed, three days after the outage occurred.

The outage also significantly impacted the healthcare industry, with some healthcare systems and hospitals postponing all or most procedures and clinicians resorting to pen and paper, unable to access EHRs.

Given the nature of the fix for many enterprises, and the popularity of CrowdStrike’s software, IT organizations have been working around the clock to restore their systems, with many still mired in doing so days after the initial faulty update was served up by CrowdStrike.

On July 20, Microsoft reported that an estimated 8.5 million Windows devices had been impacted by the outage. On July 27, Microsoft clarified that its estimates are based on crash reports, which are “sampled and collected only from customers who choose to upload their crashes to Microsoft.”

What caused the CrowdStrike outage?

In a blog post on July 19, CrowdStrike CEO George Kurtz apologized to the company’s customers and partners for crashing their Windows systems. Separately, the company provided initial details about what caused the disaster.

According to CrowdStrike, a defective content update to its Falcon EDR platform was pushed to Windows machines at 04:09 UTC (0:09 ET) on Friday, July 19. CrowdStrike typically pushes updates to configuration files (called “Channel Files”) for Falcon endpoint sensors several times a day.

The defect that triggered the outage was in Channel File 291, which is stored in “C:WindowsSystem32driversCrowdStrike” with a filename beginning “C-00000291-” and ending “.sys”. Channel File 291 passes information to the Falcon sensor about how to evaluate “named pipe” execution, which Windows systems use for intersystem or interprocess communication. These commands are not inherently malicious but can be misused.

“The update that occurred at 04:09 UTC was designed to target newly observed, malicious named pipes being used by common C2 [command and control] frameworks in cyberattacks,” the technical post explained.

However, according to CrowdStrike, “The configuration update triggered a logic error that resulted in an operating system crash.”

Upon automatic reboot, the Windows systems with the defective Channel File 291 installed would crash again, causing an endless reboot cycle.

In a follow-up post on July 24, CrowdStrike provided further details on the logic error: “When received by the sensor and loaded into the Content Interpreter, problematic content in Channel File 291 resulted in an out-of-bounds memory read triggering an exception. This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash (BSOD).”

The defective update, which included new exploit signatures, was part of CrowdStrike’s Rapid Response Content program, which the company says goes through less rigorous testing than do updates to Falcon’s software agents. Whereas customers have the option of operating with the latest version of Falcon’s Sensor Content, or with either of the two previous versions if they prefer reliability over coverage of the most recent attacks, Rapid Response Content is deployed automatically to compatible sensor versions.

The flawed update only impacted machines running Windows. Linux and MacOS machines using CrowdStrike were unaffected, according to the company.

How has CrowdStrike responded?

According to the company, CrowdStrike pushed out a fix removing the defective content in Channel File 291 just 79 minutes after the initial flawed update was sent. Machines that had not yet updated to the faulty Channel File 291 update would not be impacted by the flaw. But those machines that had already downloaded the defective content weren’t so lucky.

To remediate those systems caught up in endless reboot, CrowdStrike published another blog post with a far longer set of actions to perform. Included were suggestions for remotely detecting and automatically recovering affected systems, with detailed sets of instructions for temporary workarounds for affected physical machines or virtual servers, including manual reboots.

On July 24, CrowdStrike reported on the testing process lapses that led to the flawed update being pushed out to customer systems. In its post-mortem, the company blamed a hole in its testing software that caused its Content Validator tool to miss a flaw in the defective Channel File 291 content update. The company has pledged to improve its testing processes by ensuring updates are tested locally before being sent to clients, adding additional stability and content interface testing, improving error handling procedures, and introducing a staggered deployment strategy for Rapid Response Content.

CrowdStrike has also sent $10 in Uber Eats credits to IT staff for the “additional work” they put in helping CrowdStrike clients recover, TechCrunch reported. The email, sent by CrowdStrike Chief Business Officer Daniel Bernard, said in part, “To express our gratitude, your next cup of coffee or late night snack is on us!” A CrowdStrike representation confirmed to TechCrunch that the Uber Eats coupons were flagged as fraud by Uber due to high usage rates.

On July 25, CrowdStrike CEO Kurtz took to LinkedIn to ensure customers that the company “will not rest until we achieve full recovery.”

“Our recovery efforts have been enhanced thanks to the development of automatic recovery techniques and by mobilizing all our resources to support our customers,” he wrote.

What went wrong with CrowdStrike testing?

CrowdStrike’s review of its testing shortcomings noted that, whereas rigorous testing processes are applied to new versions of its Sensor Content, Rapid Response Content, which is delivered as a configuration update to Falcon sensors, goes through less-rigorous validation.

In developing Rapid Response Content, CrowdStrike uses its Content Configuration System to create Template Instances that describe the hallmarks of malicious activity to be detected, storing them in Channel Files that it then tests with a tool called the Content Validator.

According to the company, disaster struck when two Template Instances were deployed on July 19. “Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data,” CrowdStrike said in its review.

Industry experts and analysts have since come out to say that the practice of rushing through patches and pushing them directly to global environments has become mainstream, making it likely that another vendor could fall prey to this issue in the future.

How has recovery from the outage fared?

For many organizations, recovering from the outage is an ongoing issue. With one suggested solution for remedying the defective content being to reboot each machine manually into safe mode, deleting the defective file, and restarting the computer, doing so at scale will remain a challenge.

It has been noted that some organizations with hardware refresh plans in place are considering accelerating those plans as a remedy to replace affected machines rather than commit the resources necessary to conduct the manual fix to their fleets.

On July 25, CrowdStrike CEO Kurtz posted to LinkedIn that “over 97% of Windows sensors are back online as of July 25.”

What is CrowdStrike Falcon?

CrowdStrike Falcon is endpoint detection and response (EDR) software that monitors end-user hardware devices across a network for suspicious activities and behavior, reacting automatically to block perceived threats and saving forensics data for further investigation.

Like all EDR platforms, CrowdStrike has deep visibility into everything happening on an endpoint device — processes, changes to registry settings, file and network activity — which it combines with data aggregation and analytics capabilities to recognize and counter threats by either automated processes or human intervention. 

Because of this, Falcon is privileged software with deep administrative access to the systems it monitors, making it tightly integrated with core operating systems, with the ability to shut down activities that it deems malicious. This tight integration proved to be a weakness for IT organizations in this instance, rendering Windows machines inoperable due to the flawed Falcon update.

The company has also introduced AI-powered automation capabilities into Falcon for IT, to help bridge the gap between IT and security operations, according to the company.

What has been the fallout of CrowdStrike’s failure?

In addition to dealing with fixing their Windows machines, IT leaders and their teams are evaluating lessons that can be gleaned from the incident, with many looking at ways to avoid single points of failure, re-evaluating their cloud strategies, and reassessing response and recovery plans. Industry thought leaders are also questioning the viability of administrative software with privileged access, like CrowdStrike’s. And as recovery nears completion, CISOs have cause to reflect and rethink key strategies.

As for CrowdStrike, US Congress has called on CEO Kurtz to testify at a hearing about the tech outage. According to the New York Times, Kurtz was sent a letter by Representative Mark Green (R-Tenn.), chairman of the Homeland Security Committee, and Representative Andrew Garbarino (R-NY).

Americans “deserve to know in detail how this incident happened and the mitigation steps CrowdStrike is taking,” they wrote in their letter to Kurtz, who was involved in a similar situation when, as CTO of McAfee, the company pushed out a faulty anti-virus update that impacted thousands of customers, triggering BSODs and creating the effect of a denial-of-service attack.

Financial impacts of the outage have yet to be estimated, but Derek Kilmer, a professional liability broker at Burns & Wilcox, said he expects insured losses of up to $1 billion or “much higher,” according to The Financial Times. Insurer Parametrix pegs that number at $5.4 billion lost, just for US Fortune 500 companies, excluding Microsoft, Reuters reported.

Based on Microsoft’s initial estimate of 8.5 million Windows devices impacted, research firm J. Gold Associates has projected the IT remediation costs at $701 million, based on 12.75 million resource-hours necessary from internal technical support teams to repair the machines. That coupled with the fact that, according to Parametrix, “loss covered under cyber insurance policies is likely to be no more than 10% to 20%, due to many companies’ large risk retentions,” the financial hit from CrowdStrike is likely to be enormous.

In response to concerns around privileged access, Microsoft announced it is now prioritizing the reduction of kernel-level access for software applications, a move designed to enhance the overall security and resilience of the Windows operating system.

Questions have also been raised about suppliers’ responsibilities to provide quality assurance for their products, including warranties.

Delta Airlines, which canceled nearly 7,000 flights, resulting in more than 175,000 refund requests, has hired lawyer David Boies to pursue damages from CrowdStrike and Microsoft, according to CNBC. The news outlet reports Delta’s estimated costs as a result of the outage is $500 million. Boies led the US government’s antitrust case against Microsoft in 2001. Delta CEO Ed Bastian told CNBC that the airline had to manually reset 40,000 servers and will “rethink Microsoft” for Delta’s future.

Meanwhile, CrowdStrike shareholders filed a class-action lawsuit against the company, arguing that CrowdStrike defrauded them by not revealing that its software validation process was faulty, resulting in the outage and a subsequent 32% decline in market value, totaling $25 billion.

Ongoing coverage of the CrowdStrike failure

News

July 19: Blue screen of death strikes crowd of CrowdStrike servers 

July 20: CrowdStrike CEO apologizes for crashing IT systems around the world, details fix 

July 22: CrowdStrike incident has CIOs rethinking their cloud strategies 

July 22: Microsoft pins Windows outage on EU-enforced ‘interoperability’ deal 

July 24: CrowdStrike blames testing shortcomings for Windows meltdown

July 26: 97 per cent of CrowdStrike Windows sensors back online

July 26: Counting the cost of CrowdStrike: the bug that bit billions

July 29: CrowdStrike was not the only security vendor vulnerable to hasty testing

July 29: Microsoft shifts focus to kernel-level security after CrowdStrike incident

Aug. 1: Delta Airlines to ‘rethink Microsoft’ in wake of CrowdStrike outage

Analysis

July 20: Put not your trust in Windows — or CrowdStrike 

July 22: Early IT takeaways from the CrowdStrike outage 

July 24: CrowdStrike meltdown highlights IT’s weakest link: Too much administration

July 25: CIOs must reassess cloud concentration risk post-CrowdStrike

July 29: CrowdStrike debacle underscores importance of having a plan

July 30: CrowdStrike crisis gives CISOs opportunity to rethink key strategies

Originally published on July 23, 2024, this article has been updated to reflect evolving developments.

The last thing any CIO wants is to experience catastrophic operational issues during a peak season, but that’s exactly what executives at Southwest Airlines faced last week. While weather may have been the root cause, the 16,000 flights canceled between Dec. 19-28 far exceeded any other airlines’ operational impacts.

Experts point to Southwest’s point-to-point operating model as problematic in recovering from major weather issues compared to the hub-and-spoke model used by many major airlines. But Southwest’s technology was also cited by experts and the company’s leadership as contributing to the calamity. “IT and infrastructure from the 1990s,” said Casey A. Murray, president of the Southwest Airlines Pilots Association, and “Southwest has always been a laggard when it comes to technology,” according to Helane Becker, an aviation analyst with Cowen.

Even before the blizzard hit, Southwest Airlines CEO Bob Jordan acknowledged on Nov. 30, “We’re behind. As we’ve grown, we’ve outrun our tools. If you’re in an airport, there’s a lot of paper, just turning an aircraft.”

Surely many more details about this failure will surface over the next several months. CIOs know that tech issues get the trigger finger of blame when businesses experience operational disasters, but we also know there are culture and process issues that can be primary and often untold contributors — both well within the CIO’s purview.

So, I’ll use this opportunity to point out what questions CIOs should be asking about their enterprises based on what we can already discern from last week’s Southwest Airlines IT disaster.  

1. Are you investing enough in digital transformation?

Southwest Airlines recently announced a quarterly dividend that will pay out to shareholders starting Jan. 31 what amounts to $428 million a year. They also received $7 billion in pandemic aid and performed $5.6 billion in stock buybacks between 2017 and 2019.

And how much are they investing in their digital transformation? In 2017, Fast Company wrote that Southwest Airlines’ digital transformation “takes off” with an $800 million technology overhaul, but only $300 million was dedicated to new technology for operations.

The investment seems minuscule given that Southwest Airlines was a $33-$38 billion market capitalization airline in 2017. Its market cap has dropped significantly since then, but considering what’s being spent on buybacks and dividends, shouldn’t they have invested more to accelerate their transformation?

And that’s my question for CIOs: Are you investing enough in digital transformation? Do you have strong relationships with the other top executives and the board to raise the bar if your enterprise lags behind competitors or if legacy systems and technical debt pose a significant operational risk?

While CIOs must recession-proof their digital transformation priorities, underinvesting and slowing down can negatively affect customers, employees, and financial results. And if that doesn’t sway the executive committee, perhaps Southwest’s near 16% drop in stock price over December and the fear of having to respond to a federal investigation will get their attention.

2. What tools and protocols aid communications during a crisis?

According to CEO Jordan, Southwest does not have a quick, automated way to contact crew members who get reassigned. “Someone needs to call them or chase them down in the airport and tell them,” he said.

I’m having a hard time believing that Southwest, let alone any major enterprise, doesn’t have technologies and automated procedures to reach employees to inform them of operational changes. And during a crisis, organizations should have procedures outlined by human resources and supported by multiple technologies to reach employees, ensure their safety, and provide protocols to support operations.

Another key question is whether call centers are staffed and have scalable technologies to support a massive influx of calls and communications that often happen during a crisis. 

While we should all sympathize with customers impacted by a crisis, organization leaders must also consider employees and their well-being. Murray reported that pilots and crew waited hours to speak to staff about reassignments, and hundreds of pilots and crew members slept in airports next to passengers.

3. How quickly can you realign operations during a crisis?

Looking beyond operations, do leaders and managers have collaboration tools, real-time reporting dashboards, and forecasting machine learning models to aid in decision-making? How often do teams schedule tabletop exercises to play out what-if scenarios? Has IT invested or piloted a digital twin to help model operational changes and support decision-making during a crisis?

Southwest, like other airlines, relies on scheduling software to route pilots, crew, planes, and other equipment. But when things go wrong at a significant scale, relying on manual operations is highly problematic. “It requires a lot more human intervention and human eyesight or brainpower and can only handle so much,” said Brian Brown, president of Transport Workers Union Local 550, representing Southwest dispatchers and meteorologists

4. Is your organization learning from past failures?

This isn’t the first time Southwest Airlines canceled flights and blamed weather issues as one of the causes. They canceled over 1,800 flights over a weekend in 2021 that Southwest’s pilots’ union attributed to management’s “poor planning.”

All too often, you see organizations recover from a crisis, fix a few low-hanging issues, and go back to business as usual. The question for CIOs is whether they can use a crisis to demonstrate a strong enough business case around more holistic improvements.  

5. Does your organization have the culture to support software development?

Developing and maintaining proprietary software and customizations entails an ongoing commitment to talent development, product management disciplines, and DevOps practices. It requires prudent decision-making on what capabilities to invest in and when platforms have reached their end-of-life and require app modernizations.

SkySolver, the software Southwest uses for crew assignment, is a customized off-the-shelf software developed decades ago that the airlines customized. The software is at the root of Southwest’s delays in restoring operations, and I suspect the company’s IT leaders will now have the support to replace it.

Of course, no one wants to wait for a disaster to drive legacy modernizations, especially around complex operational systems. Too much urgency and stress can drive teams to select suboptimal partners, make costly architectural mistakes, or underinvest in scalability, quality, or security.

So the key question for CIOs is how they use this crisis to educate boards and executive committees on the fundamentals of agile software development and cloud operations. Many executives still believe that software development is a one-time investment, that maintenance budgets are discretionary, and that just moving to the cloud will solve IT infrastructure bottlenecks.  

CIOs know never to waste a good crisis to drive mindset changes. Using today’s headlines to ask the tough questions can be a catalyst for gaining new supporters and investment in digital transformation.

IT Leadership