Chaos Engineering to Improve System Resiliency

Organisations strive to scale their digital capabilities to increase revenue, business growth, and operational excellence. However, in today’s tech-savvy world, random system failures have grown more challenging to forecast and practically impossible for businesses to finance. Unexpected failures negatively influence a business’s financial line, making downtime a critical performance testing measure for engineers. These glitches may take the form of a network failure in one of the data centres, a misconfigured server setup, an unanticipated node failure, or any other kind of failure that propagates across systems. These interruptions often have devastating consequences for an organisation’s financial and reputational health.

 

A single hour of downtime may cost a business millions of dollars. According to Gartner, the average cost of IT downtime per minute is $5,600. (this is a huge number). Due to the fact that each organisation functions uniquely, the cost of downtime might range from $140,000 to $540,000 per hour. As enterprises cannot afford to wait for an outage to occur, they should consider proactive measures such as detecting system vulnerabilities and using chaos engineering methods to limit risks.

 

As shown by Chaos Engineering research, large-scale systems react to all random occurrences. It is a methodical strategy of identifying issues prior to them escalating into outages. Engineers can rapidly detect and rectify issues by software testing how a system performs under stress. Although the frequency of production releases has grown significantly, it is critical to ensure application dependability by adhering to SLAs such as application availability and customer satisfaction. The ultimate goal of chaos engineering is to mitigate the chaos induced by random occurrences by thoroughly examining methods to improve a system’s robustness. Traditional reliability engineering approaches such as incident management and service recovery processes may not be sufficient to mitigate the effect of failures. While practising chaos engineering, deliberate tests on systems are conducted to ascertain the system’s behaviour when such a circumstance happens. Gartner predicts that by 2024, more than half of big companies will use chaotic engineering practices to their digital capabilities to achieve 99.999 per cent availability.

 

Initially, Netflix justified their decision using chaos engineering, stating that it needed to tolerate unpredictable host failures when transitioning to AWS (Amazon Web Services). This culminated in Netflix’s 2010 release of Chaos Monkey. In 2011, the Simian Army introduced more failure injections to Chaos Monkey, allowing for testing various failure modes and developing resistance to them. In 2014, Netflix also created a new position called chaos engineering. And then, Gremlin launched the Failure Injection Testing (FIT) tool, which is based on Simian Army ideas and is designed to strengthen systems’ resistance to unpredictable occurrences. With many firms migrating to cloud and microservice architectures in recent years, the need for chaotic engineering has skyrocketed. Major prominent technological businesses, including Amazon, Netflix, LinkedIn, Facebook, Microsoft, and Google, are actively using Chaos Engineering to increase the stability of their systems.

 

Chaos Engineering is based on the notion of conducting meaningful experiments inside a system to elicit information about how the system reacts to failures. The chaos engineering techniques function like that of flu vaccination. As the flu vaccination encourages your body’s immune system to produce antibodies that aid in the fight against the flu virus, so does Chaos engineering with the systems. There are three stages involved in this process, and these are:

 

Step 1

 

To begin, an application team comprised of architects, developers, testers, and support engineers is formed with a few specific goals in mind. The initial approach is to identify an injectable problem and hypothesise the anticipated effect via IT or business KPIs. To arrive at potential scenarios, one may need to consider questions such as “What might go wrong,” “What if X component fails,” and “What would happen if my server runs out of memory.” To increase the entire scenario coverage, one must adopt a pessimistic attitude. Following that, a hypothesis backlog should be created that contains specifics about how the application would fail, the effect, measurement criteria, and restoration methods, among other things. Techniques such as brainstorming and analysis of event records may be used to create the hypothesis backlog. The backlog items may then be prioritised further depending on their probability of occurrence and consequences of failure. Investing effort and money to prevent all forms of failures may be virtually unattainable.

 

Step 2

 

It entails experimenting to determine the characteristics affecting a system’s availability and resilience, such as service level, mean time to repair, and so on. The experiments are designed to induce failures by raising CPU use or producing a DNS outage.

 

The tests are conducted in a sandbox or a pre-production environment during the earliest phases of chaos engineering deployment. Additionally, it is critical to limit the explosion radius to reduce an experiment’s influence on the application. As confidence grows, the explosion radius may be increased, allowing for the experiment to be conducted in a production context.

 

One may need to record the experimental strategy for each experiment, which would comprise the following:

 

  1. Measurement of the steady-state
  2. The actions you will take to precipitate a failure
  3. The procedures that will be followed to monitor the application
  4. Metrics for assessing the failure’s effect
  5. Efforts to restore the system to a stable state

 

Step 3

 

This is the last phase and is what defines the experiment’s success. If there is a negative influence on the metrics, the tests are discontinued, and the failures are evaluated. Only if a failure happens is the chaos experiment regarded successful. The necessary application updates are also added to the product backlog. If the system is durable, the trials are repeated by increasing the blast radius.

 

After the experiment is complete, the insights gained provide an understanding of the system’s real-world behaviour during random failures. This assists engineering teams in resolving difficulties or defining roll-back strategies. Chaos Engineering introduces both business and technical advantages to a company. Chaos Engineering benefits the company by preventing substantial revenue losses, improving incident management response, and improving on-call training for engineer teams and system resilience. Technically, data collected during Chaos experiments result in a better knowledge of system failure mechanisms, enhanced system design, fewer recurring occurrences, and decreased on-call pressure.

 

Numerous technologies are available to enable businesses to undertake Chaos Engineering. Chaos Monkey, Gremlin Inc., Simian Army, Jepsen, and Spinnaker are just a few well-known tools that may readily be applied inside an enterprise. For instance, using Jepsen on a distributed system may immediately discover chaos events such as component death, network difficulties, and unpredictable load generation. Simultaneously, Chaos Monkey will terminate instances randomly in production to increase the robustness of services deployed to immediate failures. Similarly, the other tools listed have their unique approach to testing with and increasing the stability of goods. You may utilise any of these depending on your needs and budget. Additionally, organisations may develop their own Chaos Engineering tools using open-source technology. While the procedure is lengthy and costly, it provides total control over the tool, customisation choices, and increased security.

 

Chaos engineering should not be seen as a one-time action conducted on an application. As applications are often updated to fulfil the demands of businesses and end-users, the likelihood of previously repaired vulnerabilities reappearing is very significant, so it is critical to verify the application via continuous chaos testing. The team may want to consider developing a regression pack comprised of prioritised chaotic experiments to validate the system’s resilience. If completely automated, these experiments may be incorporated into the DevOps pipeline and run as part of the weekly build to discover errors early in the product’s life cycle.

 

Predicting system failures has grown more challenging as application designs have become more complicated. Due to the significant cost of downtime, the companies should adopt a proactive approach by using chaos engineering methods to minimise accidents. Organisations should consider including chaos engineering as part of their DevOps process, invest in chaos engineering technologies, and strengthen their capability to increase application stability.