Tips for Measuring and Improving Cyber Resilience

Posted on March 26, 2025 by Adrian Sanabria

Resilience is the ability of systems to continue to function adequately despite changing or challenging conditions. Testing while simulating these conditions is essential to improving resilience and tracking improvements over time. This blog provides a framework for adding cyber resilience to existing systems.

What Is Resilience?

Knowledge is the key to resilience—particularly, the kind of knowledge that comes from understanding failure. Cybersecurity failures aren’t a mystery in 2024. Many details about cybersecurity failures are publicly available. We can’t simply rely on external intelligence to become resilient, however. Every organization has its own unique structure, constraints, and designs that result in a uniquely complex system that will fail in unique ways. Some of this knowledge must also come from internal testing, lessons learned, and postmortems.

Resilience is sometimes tested in isolation (e.g., a targeted DDoS attack) or along with peers (e.g., the COVID-19 pandemic or the CrowdStrike outage). These failures can also occur due to what Kelly Shortridge, author of Security Chaos Engineering, calls “acute stressors” (again, a targeted DDoS attack) or “chronic stressors” (tech debt, employee turnover, tool management overhead).

Finally, resilience simply describes a property of a system. A system can be fast, simple, inexpensive, or resilient (though, probably not all of these at once). These are the results of its design and constraints. A resilient system is one that is designed to continue functioning, despite a long list of potential stressors.

Adding Cyber Resilience to Existing Systems

We can create a cyber resilience process using a slightly modified version of the NIST Cybersecurity Framework (CSF) and Sounil Yu’s Cyber Defense Matrix (CDM):

Identify: The process of discovering necessary security controls.
Protect: Ideally, security controls will simply block or prevent negative outcomes.
Detect: We assume some controls will fail, so we attempt to detect negative outcomes.
Respond: If we do detect negative outcomes, we attempt to contain and eradicate them.
Recover: We clean up the mess, returning systems to normal service.
Improvement: Similar to the lessons learned stage of incident response, in this stage, we attempt to identify every missing or broken control, find detection engineering opportunities, and improve the speed of response and recovery.

The improvement stage is where we can start building resilience into our systems. The idea is to build a continuous feedback loop that results in constant improvement.

While the NIST CSF might imply an attack must initiate this process, that doesn’t need to be the case. Another type of feedback loop we can use, shown in Figure 1, is Kennedy Torkura’s execute, monitor, plan, analyze, and knowledge (EMPAK) loop. The goal of the EMPAK loop is to force failures to occur so they can be understood and converted into resilience.

IANS

Inspiration for building tests in the execute stage of the EMPAK loop can come from a variety of sources:

Past internal failures (i.e., internal intel)
Other organizations’ failures (i.e., studies of breaches and cyber incidents, such as The DFIR Report)
Threat intelligence on adversary tactics, techniques, and procedures, aka TTPs (i.e., external intel)
Meta studies of ransomware crew TTPs (e.g., this Tidal Cyber/Cyentia Institute report)
Analysis of failures in other industries (because cybersecurity is less unique than we think)

Other Resilience-Focused Frameworks to Consider

Other frameworks of note include:

NIST 800-160, which lists 14 techniques that can be used to improve resiliency. These are outlined in Figure 2, along with their designated NIST CSF function where applicable. Items with an asterisk denote where the technique falls in the realm of IT (vs. security) design, architecture or engineering.
Sounil Yu’s DIE triad, where the acronym stands for distributed, immutable, and ephemeral—qualities associated with modern cloud engineering that also represent cyber-resilience opportunities.
Like a business impact analysis (BIA), a cyber resilience review (CRR) can help gather important information for a cyber resilience process. However, this framework does little to address cyber resilience directly. CRR assessments resemble traditional risk assessments and BIAs.
Despite its name, the EU’s Digital Operational Resilience Act (DORA) doesn’t seem overly concerned with resilience, instead focusing more on traditional risk management and security program principles.

BIAs, CRRs, and regulations like DORA can be helpful when it comes to defining safety boundaries. If the security team isn’t clear on what constitutes a service failure, resilience testing, scoring and improvements won’t deliver useful results.

Figure 2: NIST 800-160 Techniques to Improve Resiliency

Technique	Purpose	CDM/NIST CSF
Adaptive Response	Optimize the ability to respond in a timely and appropriate manner.	Respond
Analytic Monitoring	Monitor and detect adverse actions and conditions in a timely and actionable manner.	Detect
Coordinated Protection	Implement a defense-in-depth strategy so adversaries have to overcome multiple obstacles.	Protect
Deception	Mislead, confuse, and hide critical assets from, or expose covertly tainted assets to the adversary.	Detect
Diversity	Use heterogeneity to minimize common mode failures, particularly attacks exploiting common vulnerabilities.	* IT Design
Dynamic Positioning	Increase the ability to rapidly recover from a non-adversarial incident (e.g., acts of nature) by distributing and diversifying the network.	*IT Design (DIE)
Dynamic Representation	Keep representation of the network current. Enhance understanding of dependencies among cyber and non-cyber resources. Reveal patterns or trends in adversary behavior.	Identify/Detect
Non-Persistence	Generate and retain resources as needed for a limited time. Reduce exposure to corruption, modifications, or compromise.	*IT OPs (DIE)
Privilege Restriction	Restrict privileges based on attributes of users and system elements, as well as on environmental factors.	Protect (Zero Trust)
Realignment	Minimize the connections between mission-critical and non-critical services, reducing the likelihood a failure of non-critical services will impact mission-critical services.	Protect (Zero Trust)
Redundancy	Provide multiple instances of critical resources.	*IT Design (DIE)
Segmentation	Define and sperate system elements based on criticality and trustworthiness.	*IT Design
Substantiated Integrity	Ascertain whether critical system elements have been corrupted.	Detect
Unpredictability	Make changes randomly and unexpectedly. Increase an adversary's uncertainty regarding the protections they may encounter, making it more difficult for them to ascertain the appropriate course of action.	Protect

Source: IANS, 2025

Build Resilience Metrics

Testing provides a means of measuring improvements over time. A good testing regimen includes identifying tests to demonstrate resilience, even if it is in nonproduction environments and regularly run these tests, noting outcomes.

Success or failure won’t necessarily be binary. A system struggling, but still functioning, during a DDoS attack might be considered a success for one organization (e.g., online banking), but a failure for another with strict performance requirements (e.g., online gaming).

Consider the effort necessary to bypass the control. For example, if bypassing antivirus software is trivial for an adversary, stopping commodity malware in one of these tests might not demonstrate much resilience. The inverse is absolutely a failure in resilience: Commodity malware should not be able to adversely impact production systems.

Include a roadmap and plans to remediate failures in your report to leadership (in addition to scores).

Keys to Success

To successfully build resilience, organizations must:

Maintain a good rapport with IT: Sometimes system design and architecture are key to resilience, and they are typically outside the purview of the cybersecurity team. A good partnership with IT may be necessary to improve resilience.
Be flexible and open to change: From the National Academy of Sciences journal article, Features of Resilience.
Adopt feedback loops and a learning culture: Also from the Features of Resilience article linked above.

Contributors

Adrian Sanabria

Principal Researcher, The Defenders Initiative

View More Blogs

Blogs posted to the RSAConference.com website are intended for educational purposes only and do not replace independent professional judgment. Statements of fact and opinions expressed are those of the blog author individually and, unless expressly stated to the contrary, are not the opinion or position of RSAC™ Conference, or any other co-sponsors. RSAC Conference does not endorse or approve, and assumes no responsibility for, the content, accuracy or completeness of the information presented in this blog.

Share With Your Community

Related Blogs