Embrace Failure: The Reactive Principles, Explained
Expect things to go wrong and design for resilience
Much like our own lives, it pays to have a "back up" plan in place in case things don't go as planned. In this look at the third principle of The Reactive Principles: Embrace Failure, our guest author Dr. Roland Kuhn looks at different strategies for handling failure, and why the "let it crash" concept makes sense in Reactive systems.
In this explanatory series, we look at The Reactive Principles in detail and go deeper into the meaning of things. You can refer to the original document for full context as well as these excerpts and additional discussion.
Two levels of handling failure
Reactive applications consider failure as an expected condition that will eventually occur. Therefore, failure must be explicitly represented and handled at some level, for example in the infrastructure, by a supervisor component, or within the component itself (by using internal redundancy).
Failures come in many shapes and sizes, how do we tackle them?
The quote from the principles mentions three places where failures may be handled, but at the core, the first choice is: shall it be handled internally or externally?
This choice comes with strings attached, for example a component can only handle failures that it can expect to control — it wouldn’t do much good to try and handle the fact that my own program code has been corrupted because in this scenario the failure handler may well have been affected as well. Another consideration is that external handling is only possible if the failure can be observed by the outsider that is supposed to deal with the situation.
This leads us to the conclusion that we need to establish a suitable contract between the thing that can fail and the party that is responsible for handling this failure. The fallible component needs to make failure observable to outsiders, the failure needs to be observed across space and time.
The handler on the other hand needs to be isolated in such a fashion that the failure does not incapacitate it, in other words it needs to be autonomous. This principle may need to be nested to cover all intended cases.
Among the handlers for a failure there should be one that is special: the supervisor is responsible for shielding other handlers from further failures, its job is to get the failed component up and running again. Supervision can take many shapes and forms:
It can be applied at the granularity of a function call in source code by catching exceptions or acting upon a failure result value, possibly retrying the operation or deliberately yielding a subpar result in the spirit of graceful degradation. Note that in this case all requests and responses must flow through the supervision layer, which is not the case in the examples below.
It can also mean switching to a backup database in the data access layer when the current connection pool fails with non-transient errors. Here, clients will observe failures until the switch has been successful.
It can even mean fail-over mechanisms between whole data centers based on reachability metrics and failure rates — this will typically be done by a different service or application altogether that uses infrastructure services to monitor the health of a deployment.
The common pattern is that the supervisor will need to have a recourse to fall back to, it will need to have a plan B. This plan B is redundant as long as no failure occurs, which is meant by “internal redundancy” in the quoted text above.
To recap, the failing component cannot do more than making the failure actionable by someone else so that the designated supervisor can fix the situation — which may ultimately require human intervention. How to do this has a lot to do with how our components react to crashes and subsequent restarts.
Let it crash
Where possible, this can also be used to implement self-healing capabilities although this cannot be done in a generic fashion apart from the let it crash approach of killing and restarting the component […].
Is there a magical silver bullet for handling failure?
This sounds almost too good to be true: self-healing software components! Those of you who have tried to write a program such that it automatically recovers from a broad range of potential problems will know the pain this inflicts on the codebase and on everyone who has to work on it. The reason is that the sources and mechanisms of failures are plentiful and tend to cut across the whole codebase, intertwined in unwholesome ways.
The “let it crash” technique is the only generic way to shred this Gordian knot, or rather to keep it from forming in the first place. The trick here is not the crashing part, the trick lies in writing the logic so that it can successfully restart afterward. This way, instead of writing — and testing! — a thousand combinations of failures and their respective local mitigations in various places, we only need to write an initialisation procedure for each component that can deal with possibly corrupt or intermediate data.
Put another way, it is much cleaner to just give up and start over than to try and fix a still half-running system. The probability of success is directly related to the frequency with which the corresponding code is executed: initialisation needs to happen for each successful operational run and will therefore mostly work, while experiencing some rare interplay of two failures only stands a remote chance of being covered by the test suite.
Please note that the more advanced fail-over scenarios discussed in the previous section are not generic, they are not implementations of the “let it crash” technique. With this technique we merely give the supervisor an additional choice of recourse: plan B might be to just tear down the failed instance and start a fresh one. If this does not work then we’ll still need a plan C (like the ones discussed in the previous section).
There is one very common supervisor who will by default and without hesitation try the “let it crash” approach: the human end-user of the systems and apps we build has learned to just restart the app or reboot the device in case of unexpected behavior. This is why we should employ the “let it crash” technique especially for those parts of our software that are within the grasp of human operators.
The “let it crash” technique benefits from another reactive tenet: I think it is no coincidence that message-driven design is especially well-suited for taking up work right where the program left off before the crash. Message queues are conceptually simple enough that these infrastructure components can be made very reliable. The only issue to watch out for is that an incoming message may trigger the same failure every single time, so the best and simplest approach is to discard the message that was being processed when the failure happened.
Local failure handling strategies
While these are powerful capabilities, employing them in a non-reactive context (such as within the non-distributed implementation of a single component) is usually more work than using traditional mechanisms like exceptions.
Which strategy fits best into my language and ecosystem of choice?
The last part of this post zooms in, leaving the bird’s eye view of components and nondescript supervisors and focusing on how to actually cast this in code.
Here it is important to note that different languages and ecosystems have found a variety of approaches and that there is no one size that fits all — trying to port the approach from Haskell to JS will typically only lead to pain and suffering and Rust is yet another beast, just to name a few (where pain & suffering refers to runtime overhead as well as unergonomic or error-prone syntax).
While considering the particular ecosystem’s choices we will need to watch out for how these restrict our mapping from (supervised) components to code modules and deployment units. For example, in C libraries it is not uncommon for a library to call abort() when it encounters a fatal condition, which makes it impossible to place the supervisor into the same process as the component that uses this library (shout out to Unix processes for supervision, kernel-enforced isolation is a great idea!).
Many languages have adopted exceptions as a means to communicate failures as well as error conditions — basically anything that “does not compute” is represented as an exception. Disentangling errors from failures takes some discipline in these cases, but it certainly is possible. For example, a database connection failure can be fixed one layer up by catching the exception and retrying, possibly using the backup database instance.
One noteworthy correlation is that those languages that use exceptions often make it more costly at runtime to return more than a single type of value from a function call, which is why exceptions are the most performant choice for signaling the unlikely failure case (C++ and C# being exceptions to this rule).
Other languages like Rust denote fallible operations using a Result<T,E> type that contains either the computed value or the error that prevented a value from being computed. The machine-level encoding works without heap allocations and is optimised so that this approach can be used pervasively — it does require that each function declare its fallibility in the type signature, though. The language goes to great lengths to minimise syntactic overhead, making this error handling approach the designated and idiomatic one.
The salient point here is that it is much more important to use failure handling constructs according to reactive architecture principles than it is how each language models the values that represent failure. Failure handling goes beyond — and is more work than — throwing and catching exceptions or using a Result type in function signatures. We need to embrace failure before writing the first line of code.