Self healing is important concern while developing an application. For example, if a downstream service is not available , how can the app handle this situation? Will it retry more for the service which is already down or will it understand the situation and stop hammering a failing service. What if there is failure in one subsystem which can sometime cascade, for example a thread or a socket not getting freed in timely manner can result to cascading failures. All too often, success path is well tested but not the failure path. There are many patterns for handling these failures but here are few must have pattern to gracefully handle these situation.
Pattern | Premise | Aka | How does the pattern mitigate? |
---|---|---|---|
Retry failed operations with retry strategy (Retry Pattern) Implementation Detail | Many faults are transient and may self-correct after a short delay. Have a retry pattern with strategy of increasing delay. | “Maybe it’s just a blip” | Allows configuring automatic retries. |
Protecting failing app with Circuit Breaker (Circuit-breaker) Implementation Detail | When a system is seriously struggling, failing fast is better than making users/callers wait. Protecting a faulting system from overload can help it recover. | “Stop doing it if it hurts” “Give that system a break” | Breaks the circuit (blocks executions) for a period, when faults exceed some pre-configured threshold. |
Better caller experience with timeout (Timeout) Implementation Detail | Beyond a certain wait, a success result is unlikely. | “Don’t wait forever” | Guarantees the caller won’t have to wait beyond the timeout. |
Isolate critical resources (Bulkhead Isolation) Implementation Detail | A Ship should not sink because there is hole in one place. When a process faults, multiple failing calls can stack up (if unbounded) and can easily swamp resource (threads/ CPU/ memory) in a host. This can affect performance more widely by starving other operations of resource, bringing down the host, or causing cascading failures upstream. | “One fault shouldn’t sink the whole ship” | Constrains the governed actions to a fixed-size resource pool, isolating their potential to affect others. |
Throttle clients with Rate Limit (Rate-limit) Implementation Detail | Limiting the rate a system handles requests is another way to control load. This can apply to the way your system accepts incoming calls, and/or to the way you call downstream services. | “Slow down a bit, will you?” | Constrains executions to not exceed a certain rate. |
Fallback Implementation Detail | Things will still fail – plan what you will do when that happens. | “Degrade gracefully” | Defines an alternative value to be returned (or action to be executed) on failure. |