Library Header Image Library Header Image

System Resilience at Scale


Posted on by Tatyana Sanchez

Cyberattacks are increasingly targeting critical sectors like water, energy, utilities, healthcare, and transportation. So, what can organizations do to prevent or recover from them? To better understand how to build system resilience at scale, I sat down to interview Ravi Teja Thutari, Senior Softway Engineer at Hopper. Here are some key takeaways from our discussion.

Q. What is system resilience?

A. Cyber resilience is the ability to anticipate, withstand, recover from, and adapt to adverse events and that same concept is applied to system resilience. The ability to keep systems running even when something breaks or goes wrong like when an Application Programming Interfaces (APIs) stops running or a traffic surge.

Organizations should understand their threat landscape and create and implement a recovery and backup plan to mitigate such potential threats, thereby embracing system resilience.

Q. What is a common cause of large-scale outages?

A. While working with systems that served millions of users, the most common cause of large-scale outages was organizations over-reliance on external or third-party systems such as external APIs. Without a backup plan, if those APIs go down so does an organization’s internal system.

Except for causes like natural disasters, organizations have the ability to quickly recover from cyberattacks by being proactive. This involves designing systems with the expectation that they will eventually fail, even if they are currently working perfectly.

Q. What security tools contribute to system resilience?

A. Traditional security measures and strategies, such as firewalls and antivirus software, play a crucial role in enhancing system resilience against cyberattacks. Caching and retry strategies also contribute significantly to building system resilience:

  • Caching Strategy involves storing frequently accessed data in a temporary location. For example, an organization with automated product information data can keep it in a cache and retrieve it quickly. This is especially helpful when the main system is slow or down, as users will still be able to see useful data.
  • Retry Strategy is used when a system or operation fails. An organization will retry to execute the failed system or operation. Organizations should be careful and not retry too often or too fast, as this will overload the system and cause further issues. Organizations need a delay between retries to fix the failed operations or systems.

Q. What are indications of insufficient system resilience?

A. Many older applications weren’t built with adaptive security or resilience in mind. Some organizations still rely on their outdated legacy systems--causing issues in the long run. Systems slowing down, taking longer than usual, or errors popping up are early signs of insufficient system resilience.

Q. At times, a single failure can cause problems with other parts of a system. But when this happens in a production environment, how can organizations prioritize what needs to be fixed?

A. When a system fails or is down, organizations should immediately assess how this impacts users directly. Any issue that interferes with the user experience should be prioritized and resolved first. For example, If a product checkout page is down, an organization should prioritize and fix the page, conduct a quick post-mortem to understand what happened, and review any fallback strategies. Depending on the situation, they should also look for ways to prevent the issue from happening again, especially if it’s impacting user experience.

Q. What’s a major mistake teams make when building for Scale?

A. One major mistake teams often make when building for scale is attempting to scale too early. When scaling too soon, teams might add too many multi-databases and queues before they're actually needed. It slows down development and makes things harder to debug.

Organizations should start simple when scaling. First, they should get the basic things working well and then scale step-by-step based on real usage. By doing this, real problems will guide the organization on where to grow instead of guessing and overbuilding from the start.

Q. How Can Teams Make Their Systems Resilient?

A. Aside from common security measures, one simple step all organizations can do right away is "Create set timeouts, especially for external APIs..

Without timeouts, if a service is slow and stuck, the system might sit there waiting, and that could block everything. A short timeout keeps the system "warm" and working. It's a good practice to add and makes systems more resilient when things go wrong.

Learn more about building effective cyber resilience by visiting the RSAC Library. Have thoughts on the topic that you’d like to share? Keep the conversation going using #RSAC or join a group discussion in the RSAC Community

Contributors
Tatyana Sanchez

Senior Coordinator, Content & Programming, RSAC

Blogs posted to the RSAConference.com website are intended for educational purposes only and do not replace independent professional judgment. Statements of fact and opinions expressed are those of the blog author individually and, unless expressly stated to the contrary, are not the opinion or position of RSAC™ Conference, or any other co-sponsors. RSAC Conference does not endorse or approve, and assumes no responsibility for, the content, accuracy or completeness of the information presented in this blog.


Share With Your Community

Related Blogs