CrowdStrike Outage: A Faulty Update Causes Worldwide Problems
CIPR News caught up with Dr. Gregory Bird, Professor of Cybersecurity for Liberty and Southern New Hampshire Universities, to discuss the world’s biggest IT outage in recent years.
Critical infrastructure and other organisations worldwide suffered large-scale disruption caused by a defective content update.
As a leading cybersecurity firm, CrowdStrike’s software is used by countless organizations across the globe, from small businesses to multinational corporations. On July 19, 2024, a faulty update to CrowdStrike’s Falcon Sensor security software caused a widespread IT outage affecting millions of Microsoft Windows computers worldwide, after the initial panic of what seemingly appeared to be thought a cyber attack.
The faulty update targeted Windows systems, a platform used by a majority of computers worldwide. This widespread reliance amplified the impact of the outage. This incident was unprecedented in scale and caused significant disruptions to businesses and individuals globally.
So, how did we get here and are we in a more secure, or less secure position, as a result of so much IT infrastructure under the control of so few companies?
The interconnectedness of global systems means modern businesses rely heavily on digital infrastructure, with many systems interconnected and interdependent. A disruption at one point can create a ripple effect throughout various industries.
Many critical infrastructure sectors, such as healthcare, finance, and transportation, depend on robust IT systems. This outage affected these sectors, leading to disruptions in essential services, some for many days or even weeks.
CIPR News discussed the situation with Dr. Gregory Bird, Professor of Cybersecurity for Liberty and Southern New Hampshire Universities.
“The global outage was caused by one of Crowdstrike Falcon’s regular updates, much like your antivirus type of updates. And because it’s updated more frequently, it doesn’t go through the same type of review process that the actual sensor content updates do, it actually bypasses all of those staging controls within CrowdStrike Falcon.
“The reason for this is it’s really designed to get out to the users as fast as possible to help ensure that whatever virus malware is blocked as fast as possible, and help reduce the likelihood that zero-day type of attacks are able to actually get through. Unfortunately, many of the system administrators out there, they did not understand how the staging protocols actually worked in CrowdStrike Falcon. Many of them assumed that it was for all updates that would actually follow that, not realizing that there’s two different parts to the updates, and only one followed that staging protocol.” Dr Gregory Bird explained.
It’s perceived at this stage as being a genuine human error of testing and deployment, and not the possibility it could have been an insider threat or a malicious insider cyber attack.
Why was it being pushed out so quickly without its proper tests?
This is not an unusual phenomena and has caused issues with other updates from other IT companies in the past, but just not as noticeable as this recent global outage.
Although it did go through the company’s testing, it’s just their automated testings failed. The whole reason they want to push these updates out faster and not go through your typical certification and signing type protocols, like you would have for many things that end up like drivers and whatnot for many of the systems, it’s because the longer it takes, the more susceptible the systems are to potential cyber attack. Because these updates, much like your typical antivirus and malware type updates, they are very time sensitive to get them out there to help ensure that the systems are properly protected.
Crowdstrike has come out and they’ve actually released a number of protocols that they’re going to implement to help ensure, one, these updates do get more proper testing, get better validation to help ensure that this particular error doesn’t happen again, and two, to help avoid future potential for things like insider threats.
They’re also planning to include this in a staging protocol within that system. And they’re also planning to stage future releases so that not all their users get it at the exact same time, so that if something like this were to happen, it allows them a little more time to respond before it propagates out to the entire world.
So how can this be avoided in future, better training of staff in terms of process?
“Yes. Well, I would say better training of staff. They’re supposed to be looking at some of their automated processes to help ensure that when errors do occur, they’re properly getting kicked out and not pushed out. And the other thing is really getting their staff to provide more attention to detail so that they aren’t allowing these type of errors to proceed forward.” added Dr Gregory Bird.
So, if this had been a cyber attack, it affected so much of the global IT infrastructure, how have we got to the position that so much of the vulnerable global IT infrastructure is in the hands of so few?
What could be done with these organisations without compromising security?
How can we better build in protection into such a small number of companies?
“So, this incident is a glaring reminder that relying on a single entity for service, regardless of the vendor’s reputation, it creates a very serious and dangerous single point of failure. So, companies really need to regularly take this into account when they’re assessing their incident response and disaster recovery plans.”
“While implementing multiple layers with multiple vendors can benefit from a business continuity and protecting your critical operations. The downside to that is they can also dramatically increase the complexity of your architecture and can inject issues as far as compatibility. So this type of event will happen again and it can happen with any vendor, any products.”
“The big thing organizations need to not only look at building in more redundancy to their architectures, but also focus on resiliency. This type of event isn’t new. We’ve seen it before.” continued Dr Bird, referring to similar impacts of the SolarWinds incident and what happened with McAfee back in 2010.
You always have elements like ransomware that can definitely have the potential to have a similar attack.
Organizations need to have long-term memory when they’re assessing their organizations. They need to ensure they’re properly conducting a business impact analysis, create their disaster recovery plans and incident response plans.
But not only create them, they need to ensure that they’re actually testing and validating those plans. Along with that, conduct training to ensure the employees know not only how to enact those various protocols, but they know their individual role. And going through the actual training and live exercise of those plans really helps to identify and highlight the gaps that they may have in.
“Tabletop exercises are an excellent way that organisations can actually test all of these in a safe type of setting without potentially actually impacting their environment. And while there are some great companies out there actually creating all these tabletop exercises that companies can procure, the Cybersecurity & Infrastructure Security Agency (CISA) also offers tabletop exercise packages or their CTEPs, which are basically exercises in a box. So they’re a comprehensive set of resources designed to really assist the stakeholders in conducting their own exercises.” commented Dr Bird.
So how does a company build in a resilience and business continuity plan if there are so few alternatives?
Well, in this case, fortunately, there are other alternatives out there in the market. In fact, almost once this incident happened, you immediately started seeing advertisements for many of those companies.
In fact, one of Crowdstrikes top competitors stock price jumped about 6% so far within the first week. So, you can see there’s definitely some interest in those alternatives.
There are things that companies really need to look at when they’re selecting their vendors though. While many may tend to gravitate to whoever that biggest name is, taking elements like this incident into account are things that you do need to evaluate. The other piece is, as you’re working to build in more resiliency to your environment, ensure that you actually have backups for your systems and ensure the backups work and ensure they remain current.
Dr Gregory Bird noted, “Companies need to be more proactive. They need to work with their third-party suppliers, question them about their processes and what they do to ensure events like this don’t happen. They need to also assess their existing change management and test validation procedures.”
“Do they actually have systems that allow you to stage and stagger your deployments, or does everything just go out at once? Do they have actual staging environments to test this? Do they have their canary-type users that they can do limited deployments on? So, if their tools don’t have the ability to stage those, look at possibly going to a vendor that does, or advocate with the vendor to add those type of capabilities.”
“Many companies I end up talking to, they don’t realize the power that they do have to help influence future features with these companies. Many times, going out and talking to your service advisors or account representatives can go a long way towards actually getting these type of capabilities.” added Dr Bird.
One of the big key sectors that impacted was air travel, with the system being used for transmitting of API P&R data across the world. So what does the airport do, for example, in that particular instance?
“I can’t necessarily speak for everybody throughout the entire world, but definitely here in the U.S., work with your sector risk management agency, as well as those various governing bodies that you have to help influence many of those decisions. When they get forced on you, test them. Ensure you have personnel that are very fluent in that individual systems or tools to help ensure that you can potentially avoid events and incidents like this.”
“When you don’t have a choice of what tools you use, definitely play with them extensively and work that into your various disaster recovery plannings and continuity of operations. In cases like this, the companies that didn’t, say, have all Windows machines, that had mixes with Linux and iOS, that they were able to still continue doing work. So for the areas that you can actually affect change and determine what type of products you’re using, look to expand the diversity of that.” commented Dr Bird.
Could this recent Crowdstrike caused global IT outage lead to introduction of whether it’s legislation or anything to introduce more competition, break up the big boys so that we’re not so reliant on them?
Well, it definitely has the potential to increase overall competition within that particular cybersecurity niche, we shall have to wait and see whether policymakers take the initiative for less reliance on a small number of large players, or if the market itself makes the move.
Any significant issue like this tends to open up the market to other companies, as well as the creation of new companies to step in that now see various gaps that they can meet.
“It’s very unfortunate when incidents like this happen, it definitely does open up the market more. And I think it helps kind of spark some interest in the actual users to assess what type of products they’re using. For the companies that don’t have that long-term memory issue, and they actually start remembering when these events are happening, they many times will go assess, okay, how are we actually deploying this software? What software packages are we using? What can we actually do to build in redundancy and resiliency into our overall architecture and our plans?”
“I will say that while quite tragic and unfortunate that this happened, I am impressed with the current response that CrowdStrike has had as far as their action plan to actually address this to ensure it doesn’t happen again in the future. I would love to advocate that other cybersecurity companies that are out there take note of this.” concluded Gregory Bird.
In essence, the combination of CrowdStrike’s widespread use, the reliance on Windows, and the interconnected nature of global systems created a perfect storm for a massive outage with far-reaching consequences. Something we need to guard against in the future, as the consequences could be even more severe.