Human factors of security and cybersecurity professor at University of Waterloo reflects on CrowdStrike global tech outage | Cybersecurity and Privacy Institute

Written by Dr. Kami Vaniea and Regina Ashna Singh

The Blue Screen of Death – a looming nightmare of many – came to fruition last week during a global tech outage set off by one of the largest cybersecurity companies in the world.

According to Global News, “the reason for the outage is a single software update originating from cybersecurity firm CrowdStrike. The faulty update has caused some computers running Windows to experience the Blue Screen of Death. In other words, instead of booting up as normal, affected computers are crashing. The update did not impact computers running Mac or Linux.”

However, the University of Waterloo, thankfully, was unaffected. Kami Vaniea, professor in the Department of Electrical and Computer Engineering, says the shutdown “hit the airlines, it hit the railways…it hit large companies such as hospitals and other big groups that care about security and therefore invested in their safety by hiring a security company, namely CrowdStrike.”

Microsoft.com estimates that “CrowdStrike’s update affected 8.5 million Windows devices, or less than one percent of all Windows machines.”

What led to the outage?

The Cybersecurity and Privacy Institute (CPI) member, Vaniea, said she believes at its core, the CrowdStrike outage is happening because of two related issues that were not handled well: testing and update management.

Normally when popular software needs to be changed, the vendor creates an update, tests it, and releases it for other users to install. In large organizations, like airlines, a system administrator will first check new updates to make sure they will work as expected. Often, they will also perform a "staged deployment" which means they will update the least critical computers first, ensure everything is stable and then update the most critical devices.

Security tools, such as anti-virus, often bypass these tests when updating lists of “signatures” which are basically lists of ‘bad-thing’ patterns. Tests of such files are done by the security company, but typically not by client organizations because signatures are updated multiple times a day and are just lists, not computer code. It is also vital to block identified ‘bad things’ quickly; waiting for tests can be dangerous.

Prior to this particular outage, CrowdStrike issued an update that had an error no one knew about…Their internal testing should have caught the error, but it did not, and professor Vaniea says, “We don't know why yet…” The error only happens under very special circumstances, that in theory, should never occur. But on Friday, July 19, 2024, CrowdStrike published a configuration file that caused those special circumstances which was automatically downloaded by all computers running their software, bypassing all testing normally done by system administrators. The combination of the old error no one had noticed, and the new configuration file, caused a problem in how Windows boots up.

How do people and companies impacted move forward?

Typically, issues caused by errors are fixed quickly and the public only experiences a downtime of a few hours. To illustrate, Facebook experienced a mass outage back in October 2021, but their team was able to rectify the problem in less than seven hours as reported by The Verge. However, the CrowdStrike issue is taking days to fix because system administrators must manually remove the problematic configuration file from every single device affected. The good news is that most employees with a computer science based education are able to execute this solution, and the majority of the companies impacted by the outage will have a dedicated team and sufficient resources to roll out the repair plan. However, visiting and fixing every computer still takes time.

What are the implications of this global tech outage?

There are several implications that surface as a result of the CrowdStrike crisis:

Human Computer Interaction (HCI)

Professor Vaniea’s current research specifically focuses on exploring the reasons why both “normal” people and system administrators are reluctant to install updates. Below are just a couple of the numerous examples:

Risk – for system administrators, for instance, the potential risk of downtime or bugs impedes their decisions to download an update.

Prevention of loss – for “normal” people, sometimes the potential to lose a specific program that is beneficial to their livelihood impacts their decisions to install an update. For example, imagine you were about to give a presentation and an “Update PowerPoint” dialog popped up. You might rationally decide to delay updating until after your presentation.

Lawsuits can also cause apps to be forcibly updated. For example, in 2012, the app Speak For Yourself - which helps kids who struggle to speak due to issues like autism and cerebral palsy - was pulled from the Apple App Store due to a lawsuit. To prevent the app from vanishing, some parents disabled all communication with the Apple App store.

flowchart describing the overview of the stages of updating and the issues respondents experience at each stage

Source: Tales of Software Updates: The process of updating software

Automatic updates are key

Professor Vaniea says “automatically updating is much safer” and advises organizations and users to get on board with this strategy if it is not implemented already. After an update is released, attackers look at the updated code, learn what it is fixing, and then write attack code to target that specific thing. Those who update quickly will be protected, but those that update slowly risk attack. For example, in 2017 Equifax lost 143 million US customers’ worth of data. An update that could have fixed the problem was available for two months before the breach. Vaniea says, “If Equifax had installed it, they would not have lost the data.”

Lack of diversity, power grids, and cyberattacks

The CrowdStrike tech outage is partially a testament to CrowdStrike’s market reach. “All these organizations are either CrowdStrike clients or someone who depends on a CrowdStrike client,” states Vaniea. She goes on to say, “While great for CrowdStrike, having so many organizations depend on one company creates a lack of diversity. If a vulnerability or flaw is found in that one dependency, then everyone depending on it is impacted.”

If one looks beyond large tech companies to infrastructure like power and water grids, it is natural to worry that something similar might happen. While possible, Vaniea points out that most of these systems are fairly old, predating the modern tendency to network everything. That means they are all running different software and while that software is likely not 100% secure, an attack that works on one part of the grid is unlikely to work on other parts. In other words, these grids have a diversity of systems, so one attack is unlikely to impact the whole grid.

Vaniea says, “The modern approach of running the same software on many systems makes it easier to keep them updated, fix identified vulnerabilities, and provide maintenance. But it also means that if an attacker finds a vulnerability, then every computer of that type is vulnerable.”

Further research and understanding of human behaviour with computers, enforcing automatic updates, and increasing diversity will help prevent a CrowdStrike-like disaster in the future.