Next Steps For Tech & Security Leaders After CrowdStrike Global Outage

Technology leaders woke up this morning to find that a software update by cybersecurity vendor CrowdStrike had gone wrong, disrupting major systems at numerous organizations. The impact has spread globally, with airports, governments, financial institutions, hospitals, ports, transportation hubs, and media outlets facing significant operational disruptions.

The outage brings severe economic consequences, as well as having a widespread impact on the health and well-being of those affected. Emergency response services in some cities have been disrupted, and hospitals across the globe have had to cancel scheduled surgeries. Airlines, meanwhile, are urging people not to come to the airport (with American Airlines, Delta, and United halting operations for a time).

Earlier on Friday morning, CrowdStrike issued what seemed to be a routine software update to its Falcon sensor (endpoint protection, XDR, and CWP) software. The update caused Windows hosts running CrowdStrike Falcon (with its kernel-based threat protection) to fail to boot, getting stuck on a Blue Screen of Death. CrowdStrike CEO George Kurtz confirmed in an update on X this morning that “Mac and Linux hosts are not impacted.”

Because of the way that the update has been deployed, recovery options for affected machines are manual and thus limited: Administrators must attach a physical keyboard to each affected system, boot into safe mode, remove the compromised CrowdStrike update, and then reboot (see the official CrowdStrike knowledge-base article here). Some administrators have also stated that they’ve been unable to gain access to BitLocker hard-drive encryption keys to perform remediation steps. Administrators should follow CrowdStrike guidance via official channels to work around this issue if impacted.

Forrester recommends that tech leaders do the following immediately:

Empower authorized system administrators to fix the problems quickly and effectively. This includes backing up hard disk encryption keys (BitLocker or another third party), as these may be critical for recovery in such instances, as well as using privileged identity management solutions for break-glass emergency situations.
Communicate effectively and clearly. Communicate clearly, both internally and externally, on the impacts, status, and progress of your remediation efforts. Enlist marketing and PR to craft that messaging. Stay grounded on the realistic impacts (not the theoretical worst-case scenario), and keep an even tone.
Watch your back. Crisis events require an “all hands on deck” response, but be sure to reserve a few analysts to continue monitoring other systems. Threat actors may use this time to attack while you’re distracted.
Pay attention to the vendor’s communication strategies, and follow official advice. Follow official channels for instructions on addressing issues. Following social media advice may result in inconsistent, conflicting, or outright incorrect/damaging advice.
Look after your people. This disruption hit on Friday evening in some geographies, right as people were headed home for their weekend, but tech incidents like this need support from many employees, and your teams will be working 24/7 over the weekend to recover. Support them by ensuring that they have adequate support and rest breaks to avoid burnout and mistakes. Clearly communicate roles, responsibilities, and expectations.

What To Do After The Crisis Subsides

Tech leaders should take the following steps once the immediate issue is fixed:

Implement infrastructure automation. Infrastructure automation is a must-have for controlled and managed software rollouts. While an automated recovery is not possible in this specific instance, tech leaders should use infrastructure automation where possible to avoid manual recovery procedures, along with developing rollback and regression capabilities, testing them often to ensure that you can recover to a prior state.
Refresh and rehearse your IT outage response plan. Regular practice of major outage response plans is vital, as is the requirement to put into practice what you learn. Tech leaders should develop the IT outage response plan and build contingencies and communications protocols for all major systems, services, and applications, as well as all associated recovery procedures for working with and restoring them. Create and practice a “back-out” procedure specifically for updates that don’t go as planned to return to a known, good state.
Get unified, written warranties from security vendors on their quality assurance processes, as well as threat detection effectiveness. CrowdStrike offers a warranty if you suffer a breach while using its Falcon Complete platform, but this is specific to security breaches. Customers need to ask for business interruption indemnification clauses in the event of a software update gone awry such as the current CrowdStrike one. For software that runs in trusted spaces with automatic updates, especially those that impact/use kernel modules or otherwise may impact operating system stability, this could be seen as a necessary step toward building back trust.

What Tech Leaders Should Do In The Longer Term

Tech leaders should take the following longer-term steps:

Reevaluate third-party risk strategy and approach. If a third-party risk management program is overly focused on compliance, you’ll likely miss significant events like this one that impact even compliant vendors. Tech leaders can’t afford to overlook assessing the vendor against multiple risk domains such as business continuity and operational resilience, not just cybersecurity. Tech leaders also need to map their third-party ecosystem to identify significant concentration risk among vendors, especially those that support critical systems or processes.
Use the contract as a risk mitigation tool. Tech leaders along with procurement and legal teams should update language to include new security and risk clauses that assign accountability during disruptive events and clearly outline timeframes for vendors to patch and remediate. Consider using such incidents and their impacts as a basis for implementing measures in contracts or service-level agreements. If vendors push back, you’ll need to consider whether the price you negotiated still makes sense and, possibly, whether to do business with them at all.

This post was written by VP, Principal Analyst Andras Cser and it originally appeared here.

Read the full article here