On July 19, 2024, CrowdStrike rolled out a content update that included malware signatures to its Falcon endpoint protection users that led to a major outage of Windows machines. The affected systems started experiencing a Blue Screen of Death (BSOD) error, rendering them unusable. The event impacted 8.5 million windows devices and numerous critical infrastructure services.
At the time of this article, most organizations have effectively contained the issue and recovered critical services. In the spirit of never letting a crisis go to waste, let’s look at some lessons learned from this outage and changes to consider.
Third-Party Outages as a Primary Incident Response Scenario
With increasing dependencies on third-party code, distributed services, and automated updates, it is essential to treat third-party outages or compromises as a primary incident scenario. Incident Response Plans should include procedures for taking affected services offline, implementing alternate access methods, managing crisis communications, notifying leadership and the board, and other actions typical of crisis scenarios.
Sound Product Incident Response
There has been much discussion about vendor oligopolies and risk concentration. In my view, this is a necessity in modern architecture and operations. I believe most affected CrowdStrike customers will not switch their xDR solution but will instead focus on enhancing their business resilience plans.
Concentration of Risk
There have been numerous commentaries about vendor oligopolies and risk concentration. In my opinion, this is a necessity in modern architecture and operations. I foresee most impacted CrowdStrike customers will not change their xDR solution but instead will invest in improving their business resilience plans.
Business Resilience Needs Further Investment
In today’s increasingly distributed business and technology environments, enhancing resilience requires alternative strategies to maintain access to critical services.
To mitigate operational risk in scenarios like the CrowdStrike outage, organizations should invest in solutions such as just-in-time VDI (Virtual Desktop Infrastructure) or secure browsers over BYOD (Bring Your Own Device) to ensure business continuity.
Mitigating Agents’ Risk
Organizations should inventory third-party agents running on their endpoints, servers, and cloud infrastructure. High-risk agents, especially those interacting with the kernel and system, should be covered by the Business Continuity Plan and should require adequate assurances from vendors. Mitigations should include the vendor’s monitoring of updates’ impact on performance, support for fail-safe mechanisms, agent security and controlled deployment processes.
Controlled Deployments
Vendors need to provide mechanisms for controlled deployments, encompassing both agent binaries and channel updates (the CrowdStrike outage was caused by a faulty channel update). These mechanisms should enable customers to have granular control over the desired speed and quality assurance levels for updates to production systems. To achieve this, both automation and a flexible policy engine are essential for scaling agent deployments. Flexible policies should support canary deployments, environment-specific deployments (Dev, QA, Test), and maintaining an n-1 or n-2 state. Note that opting for an n-1 or n-2 strategy for xDR’s channel updates may increase the risk of compromise.
Essential Asset Inventory and Management
In response to the CrowdStrike incident, many IT departments had to manually update endpoints that are not centrally managed. Even more problematic are situations where recovery attempts failed (e.g., lost OS recovery keys). Robust asset inventory and centralized configuration management solutions across endpoints, servers, and cloud environments are crucial for maintaining business continuity and end-user productivity. These solutions must be tested for script and configuration change rollouts in complex scenarios like the CrowdStrike incident.
Contractual Agreements
Vendor/customer agreements can be double-edged swords. As a customer of technology companies, it’s essential to include clauses that hold vendors accountable in case of major outages.
As a technology vendor, it’s crucial to protect your financial interests by avoiding unlimited indemnity in contracts. Instead, cap indemnification to a specific number of months of customer spend or the total fees paid. If your organization has existing customer agreements with unlimited liability, have your legal team renegotiate these terms at contract renewal.
CrowdStrike’s standard terms and conditions outline the customer indemnification parameters. Impacted organizations can’t expect to get more than a refund of the fees paid to the company.
Cyber Insurance
Given the significant material, operational, and financial impacts on companies affected by the CrowdStrike incident, it’s crucial to ensure your cyber insurance policy provides the appropriate coverage.
While cyber insurance policies often cover “system failures,” which may include events like this outage, some policies contain exclusions, such as for third-party software, that can hinder business interruption claims. Review your policy and consult with your insurance provider/broker to make sure you are protected.
The CrowdStrike incident will go down as one of the largest IT outage in history. It is crucial to identify the factors that led to such a widespread impact and take appropriate actions to mitigate similar risks in the future.