Patterns for designing self healing apps

Self healing is important concern while developing an application. For example, if a downstream service is not available , how can the app handle this situation? Will it retry more for the service which is already down or will it understand the situation and stop hammering a failing service. What if there is failure in one subsystem which can sometime cascade, for example a thread or a socket not getting freed in timely manner can result to cascading failures. All too often, success path is well tested but not the failure path. There are many patterns for handling these failures but here are few must have pattern to gracefully handle these situation.

PatternPremiseAkaHow does the pattern mitigate?
Retry failed operations with retry strategy (Retry Pattern)
Implementation Detail
Many faults are transient and may self-correct after a short delay. Have a retry pattern with strategy of increasing delay. “Maybe it’s just a blip”Allows configuring automatic retries.
Protecting failing app with Circuit Breaker
(Circuit-breaker)
Implementation Detail
When a system is seriously struggling, failing fast is better than making users/callers wait.

Protecting a faulting system from overload can help it recover.
“Stop doing it if it hurts”

“Give that system a break”
Breaks the circuit (blocks executions) for a period, when faults exceed some pre-configured threshold.
Better caller experience with timeout
(Timeout)
Implementation Detail
Beyond a certain wait, a success result is unlikely.“Don’t wait forever”Guarantees the caller won’t have to wait beyond the timeout.
Isolate critical resources (Bulkhead Isolation)
Implementation Detail
A Ship should not sink because there is hole in one place. When a process faults, multiple failing calls can stack up (if unbounded) and can easily swamp resource (threads/ CPU/ memory) in a host.

This can affect performance more widely by starving other operations of resource, bringing down the host, or causing cascading failures upstream.
“One fault shouldn’t sink the whole ship”Constrains the governed actions to a fixed-size resource pool, isolating their potential to affect others.
Throttle clients with Rate Limit (Rate-limit)
 Implementation Detail
Limiting the rate a system handles requests is another way to control load.

This can apply to the way your system accepts incoming calls, and/or to the way you call downstream services.
“Slow down a bit, will you?”Constrains executions to not exceed a certain rate.
Fallback
Implementation Detail
Things will still fail – plan what you will do when that happens.“Degrade gracefully”Defines an alternative value to be returned (or action to be executed) on failure.