Resilience is the ability of systems to continue to function adequately despite changing or challenging conditions. Testing while simulating these conditions is essential to improving resilience and tracking improvements over time. This blog provides a framework for adding cyber resilience to existing systems.
What Is Resilience?
Knowledge is the key to resilience—particularly, the kind of knowledge that comes from understanding failure. Cybersecurity failures aren’t a mystery in 2024. Many details about cybersecurity failures are publicly available. We can’t simply rely on external intelligence to become resilient, however. Every organization has its own unique structure, constraints, and designs that result in a uniquely complex system that will fail in unique ways. Some of this knowledge must also come from internal testing, lessons learned, and postmortems.
Resilience is sometimes tested in isolation (e.g., a targeted DDoS attack) or along with peers (e.g., the COVID-19 pandemic or the CrowdStrike outage). These failures can also occur due to what Kelly Shortridge, author of Security Chaos Engineering, calls “acute stressors” (again, a targeted DDoS attack) or “chronic stressors” (tech debt, employee turnover, tool management overhead).
Finally, resilience simply describes a property of a system. A system can be fast, simple, inexpensive, or resilient (though, probably not all of these at once). These are the results of its design and constraints. A resilient system is one that is designed to continue functioning, despite a long list of potential stressors.
Adding Cyber Resilience to Existing Systems
We can create a cyber resilience process using a slightly modified version of the NIST Cybersecurity Framework (CSF) and Sounil Yu’s Cyber Defense Matrix (CDM):
- Identify: The process of discovering necessary security controls.
- Protect: Ideally, security controls will simply block or prevent negative outcomes.
- Detect: We assume some controls will fail, so we attempt to detect negative outcomes.
- Respond: If we do detect negative outcomes, we attempt to contain and eradicate them.
- Recover: We clean up the mess, returning systems to normal service.
- Improvement: Similar to the lessons learned stage of incident response, in this stage, we attempt to identify every missing or broken control, find detection engineering opportunities, and improve the speed of response and recovery.
The improvement stage is where we can start building resilience into our systems. The idea is to build a continuous feedback loop that results in constant improvement.
While the NIST CSF might imply an attack must initiate this process, that doesn’t need to be the case. Another type of feedback loop we can use, shown in Figure 1, is Kennedy Torkura’s execute, monitor, plan, analyze, and knowledge (EMPAK) loop. The goal of the EMPAK loop is to force failures to occur so they can be understood and converted into resilience.
Inspiration for building tests in the execute stage of the EMPAK loop can come from a variety of sources:
- Past internal failures (i.e., internal intel)
- Other organizations’ failures (i.e., studies of breaches and cyber incidents, such as The DFIR Report)
- Threat intelligence on adversary tactics, techniques, and procedures, aka TTPs (i.e., external intel)
- Meta studies of ransomware crew TTPs (e.g., this Tidal Cyber/Cyentia Institute report)
- Analysis of failures in other industries (because cybersecurity is less unique than we think)
Other Resilience-Focused Frameworks to Consider
Other frameworks of note include:
- NIST 800-160, which lists 14 techniques that can be used to improve resiliency. These are outlined in Figure 2, along with their designated NIST CSF function where applicable. Items with an asterisk denote where the technique falls in the realm of IT (vs. security) design, architecture or engineering.
- Sounil Yu’s DIE triad, where the acronym stands for distributed, immutable, and ephemeral—qualities associated with modern cloud engineering that also represent cyber-resilience opportunities.
- Like a business impact analysis (BIA), a cyber resilience review (CRR) can help gather important information for a cyber resilience process. However, this framework does little to address cyber resilience directly. CRR assessments resemble traditional risk assessments and BIAs.
- Despite its name, the EU’s Digital Operational Resilience Act (DORA) doesn’t seem overly concerned with resilience, instead focusing more on traditional risk management and security program principles.
BIAs, CRRs, and regulations like DORA can be helpful when it comes to defining safety boundaries. If the security team isn’t clear on what constitutes a service failure, resilience testing, scoring and improvements won’t deliver useful results.
Figure 2: NIST 800-160 Techniques to Improve Resiliency
Technique | Purpose | CDM/NIST CSF |
Adaptive Response | Optimize the ability to respond in a timely and appropriate manner. | Respond |
Analytic Monitoring | Monitor and detect adverse actions and conditions in a timely and actionable manner. | Detect |
Coordinated Protection | Implement a defense-in-depth strategy so adversaries have to overcome multiple obstacles. | Protect |
Deception | Mislead, confuse, and hide critical assets from, or expose covertly tainted assets to the adversary. | Detect |
Diversity | Use heterogeneity to minimize common mode failures, particularly attacks exploiting common vulnerabilities. | * IT Design |
Dynamic Positioning | Increase the ability to rapidly recover from a non-adversarial incident (e.g., acts of nature) by distributing and diversifying the network. | *IT Design (DIE) |
Dynamic Representation | Keep representation of the network current. Enhance understanding of dependencies among cyber and non-cyber resources. Reveal patterns or trends in adversary behavior. | Identify/Detect |
Non-Persistence | Generate and retain resources as needed for a limited time. Reduce exposure to corruption, modifications, or compromise. | *IT OPs (DIE) |
Privilege Restriction | Restrict privileges based on attributes of users and system elements, as well as on environmental factors. | Protect (Zero Trust) |
Realignment | Minimize the connections between mission-critical and non-critical services, reducing the likelihood a failure of non-critical services will impact mission-critical services. | Protect (Zero Trust) |
Redundancy | Provide multiple instances of critical resources. | *IT Design (DIE) |
Segmentation | Define and sperate system elements based on criticality and trustworthiness. | *IT Design |
Substantiated Integrity | Ascertain whether critical system elements have been corrupted. | Detect |
Unpredictability | Make changes randomly and unexpectedly. Increase an adversary's uncertainty regarding the protections they may encounter, making it more difficult for them to ascertain the appropriate course of action. | Protect |
- Maintain a good rapport with IT: Sometimes system design and architecture are key to resilience, and they are typically outside the purview of the cybersecurity team. A good partnership with IT may be necessary to improve resilience.
- Be flexible and open to change: From the National Academy of Sciences journal article, Features of Resilience.
- Adopt feedback loops and a learning culture: Also from the Features of Resilience article linked above.