The recent CrowdStrike outage underscores a critical need in today’s digital landscape: The role of a chief resilience officer (CRO). As businesses increasingly rely on complex digital infrastructures, the risks of failures and their potential impact have escalated. The CrowdStrike incident is a stark reminder of why it is time for organizations to appoint a CRO.
A CRO’s primary responsibility is to ensure that resilience is embedded in every aspect of the organization. This involves overseeing comprehensive testing, robust change management and incident response protocols. The goal is to ensure that the company can withstand and quickly recover from disruptions, safeguarding both its reputation and bottom line.
Preventing a Single Point of Failure
The CrowdStrike outage was caused by a routine software update that went wrong, highlighting a critical gap in their testing and validation processes. A CRO would ensure that such updates are thoroughly tested in a variety of environments before deployment, preventing similar incidents from occurring in the future. This role involves establishing rigorous testing protocols and maintaining a culture of continuous improvement. A CRO will also ensure that dependencies are understood and managed effectively, preventing a single point of failure from taking down critical services like an airline company, a hospital, or a shipping company.
The CRO would also oversee the implementation of standards and advanced monitoring and alert systems. These tools provide early warnings of potential issues, allowing for rapid response and mitigation. By integrating these tools into a comprehensive incident management framework, the CRO ensures that the organization is always prepared to address problems swiftly and effectively.
Key Components of Resiliency
Transparency and communication are key components of resilience. The CRO would ensure that the organization maintains open lines of communication with clients, stakeholders and the public, especially during crises. This builds trust and demonstrates the company’s commitment to accountability and resolution.
Additionally, the CRO would be responsible for fostering a culture of resilience within the organization. This includes regular training and drills for employees, ensuring they are prepared for crises and understand best practices in change management and incident response. A resilient organization is one where everyone is ready to act effectively when problems arise.
The role of the CRO is comparable to that of the chief information security officer (CISO), which became crucial in the late 1990s and early 2000s due to escalating cyberthreats. Just as the CISO is essential for managing security risks, the CRO is vital for managing the broader spectrum of resilience risks in today’s digital world.
IT Outages
The recent CrowdStrike incident is a stark reminder that our entire world is powered by digital experiences and that the internet is neither magically infallible nor inherently resilient. This outage affected critical sectors like healthcare, banking and travel. An estimated 3,400 flights were canceled, making it the worst day of the year for flight cancellations. 911 systems were impacted, transit was disrupted and people may have died in the process. This isn’t just about shopping carts — it’s about life and death. The whole world suddenly cares about IT outages, a field often overlooked and underappreciated where even the simplest of issues can shut down the world in seconds.
According to a Forrester study, even small Internet disruptions can lead to massive financial losses, with 39% of companies losing between $500,000 – $999,999 in a single month. The CrowdStrike outage was felt on an exponential scale, rippling across the world, global economies, industries, and almost every sector. We can’t even begin to calculate the loss financially. Here at Catchpoint, we have seen firsthand how comprehensive monitoring and a proactive approach to resilience can mitigate the impact of outages.
Mitigating Risks
Internet resilience can be defined as the capacity to ensure the availability, performance, reachability and reliability of the internet stack despite adverse conditions. It does not mean the internet will always be perfect or that downtime is a thing of the past. There will always be interruptions of service, but internet resilience means you can quickly bounce back and minimize impact. The internet was not built for the scale it operates on today, nor to meet the needs of its modern operations. Infamously, it started as an academic experiment and its early inventors would never have envisaged a world in which global business, and much else besides, was reliant on it. It simply wasn’t designed for today’s level of scale, volume and complexity. Its origins — and resulting ramifications — continue to define some of its greatest challenges in terms of internet resilience today.
The CrowdStrike incident highlights the urgent need for a chief resilience officer. By appointing a CRO, organizations can ensure they are prepared to navigate the complexities of the digital age, mitigating risks and maintaining stability in the face of disruptions. The time for a CRO is now, and the benefits are clear: Improved preparedness, swift recovery and a stronger, more resilient organization. However, the true challenge lies in whether CrowdStrike can recover from this devastating blow. This is a wake-up call and the trust lost and the damage to their reputation may take years to rebuild.