Resiliency is the ability to recover from transient failures. The app’s recovery strategy restores normal function with minimal user impact. Failures can happen in cloud environments, and your app should respond in a way that minimizes downtime and data loss. In an ideal situation, your app handles failures gracefully without the user ever knowing there was a problem.
Because microservice environments can be volatile, design your apps to expect and handle partial failures. A partial failure, for example, can include code exceptions, network outages, unresponsive server processes, or hardware failures. Even planned activities, such as moving containers to a different node within a Kubernetes cluster, can cause a transient failure.
Resiliency approaches
In designing resilient applications, you often have to choose between failing fast and graceful degradation. Failing fast means the application will immediately throw an error or exception when something goes wrong, rather than try to recover or work around the problem. This allows issues to be identified and fixed quickly. Graceful degradation means the application will try to keep operating in a limited capacity even when some component fails.
In cloud-native applications, it’s important for services to handle failures gracefully rather than fail fast. Since microservices are decentralized and independently deployable, partial failures are expected. Failing fast would allow a failure in one service to quickly take down dependent services, which reduce overall system resiliency. Instead, microservices should be coded to anticipate and tolerate both internal and external service failures. This graceful degradation allows the overall system to continue operating even if some services are disrupted. Critical user-facing functions can be sustained, avoiding a complete outage. Graceful failure also allows disturbed services time to recover or self-heal before impacting the rest of the system. So for microservices-based applications, graceful degradation better aligns with resiliency best practices like fault isolation and rapid recovery. It prevents local incidents from cascading across the system.
There are two fundamental approaches to support a graceful degradation with resiliency: application and infrastructure. Each approach has benefits and drawbacks. Both approaches can be appropriate depending on the situation. This module explains how to implement both code-based and infrastructure-based resiliency.
Code-based resiliency
To implement code-based resiliency, .NET has an extension library for resilience and transient failure handling, Microsoft.Extensions.Http.Resilience
.
It uses a fluent, easy-to-understand syntax to build failure-handling code in a thread-safe manner. There are several resilience policies that define failure-handling behavior. In this module, you apply the Retry and Circuit Breaker strategies to HTTP client operations.
prince2 certification training courses malaysia