Accept Uncertainty: The Reactive Principles, Explained
Build reliability despite unreliable foundations
The Taoist-sounding phrase above describes the main concept of the second principle of The Reactive Principles: Accept Uncertainty. What does it mean? Perhaps it’s most directly related to understanding the limitations and capabilities of your system’s foundation before asking unreasonable things from it.
In this explanatory series, we look at The Reactive Principles in detail and go deeper into the meaning of things. You can refer to the original document for full context as well as these excerpts and additional discussion.
The Scary World of Distributed Systems
"As soon as we cross the boundary of the local machine, or of the container, we enter a vast and endless ocean of nondeterminism: the world of distributed systems. It is a scary world in which systems can fail in the most spectacular and intricate ways, where information becomes lost, reordered, and corrupted, and where failure detection is a guessing game. It’s a world of uncertainty."
'Non-determinism' sounds a bit philosophical. How can we connect this concept to real life?
To carry on with the philosophical, life really is non-deterministic, or unpredictable, in nature. We don’t know what will happen next, we don’t know if it will rain tomorrow. We can make some good assumptions, but until we get there we do not know the outcome.
In traditional computing, we have been able to assume determinism because of the simplicity of the contained nature of that system and its ability to know all about itself.
For example, if you place yourself in a room with no doors and a few static items, it’s easy to determine what will happen within that room. Once we leave the confines of that room, or add a few doorways, determining what is going to happen next becomes a lot harder. This is similar to the way we add systems to a network of devices.
Trade Offs: Strong Consistency vs Eventual Consistency vs ...
"Even though there are well established distributed algorithms to tame this uncertainty and produce a strongly consistent view of the world, those algorithms tend to exhibit poor performance and scalability characteristics and imply unavailability during network partitions. As a result, for distributed systems, we have had to give up most of them as a necessary tradeoff to achieve responsiveness, moving us to agree to a significantly lesser degree of consistency, such as causal, eventual, and others, and accept the higher level of uncertainty that comes with them.
Why do many distributed algorithms tend to suffer with performance and scalability?
We have traditionally worked with data in single node environments and our normal assumptions are that you read and write data as required in realtime. When we move to a distributed system, many want that same easy approach of reading and writing data in a nice, orderly, consistent fashion. But when we try to replicate this approach in a distributed system, problems grow exponentially.
We start facing problems like machines with different clocks, network unreliability, network and machine crashing, the increasing complexities of synchronization, even solar flares! When we start building distributed systems, we tend to try to fit everything into our single computer mentality jumping through hoops to make this a reality. This denial of the truths of distributed systems often leads to stability issues and performance problems.
As developers, we have the ability to think about our data differently. The less constraints we put on our data, the easier it is to scale reliably. We need to manage our data so that the order of it doesn’t matter. We need to accept that eventual consistency can be a better way to model our domain. We need to embrace that in a distributed system we need to change our expectations and tools to match that environment.
Methods for Managing Time
"This has a lot of implications: we can’t always trust time as measured by clocks and timestamps, or order (causality might not even exist). Accepting this uncertainty, we have to use strategies to cope with it. For example: rely on logical clocks (such as vector clocks); when appropriate use eventual consistency (e.g. certain NoSQL databases and CRDTs ); and make sure our communication protocols are associative (batch-insensitive), commutative (order-insensitive), and idempotent (duplication-insensitive)."
Hey, a lot of words and terms! Are these methods and paradigms ideal?
None of these are a silver bullet, really. They are more of a way of dealing with the knowledge that there is no perfect solution. To determine which of these tools to use, we first need to understand our system and then choose the method to best meet our needs.
If there is the possibility that we could get the same identical request twice (very common), we make our system idempotent by ensuring that processing that message twice has no effect.
If having a timestamp is required, we could use a logical clock. This is useful compared to “wall clock” time, which is very unreliable in distributed systems. It’s hard to synchronize clocks accurately across machines, and latency can also get in the way. Logical clocks provide you with cause and effect (causality)–which is most often what you actually need–and for that purpose, it is more accurate. If information must be accurate, but can be a few seconds behind, we can utilize eventual consistency.
These are all just tools that we can use for certain jobs, knowing the job and picking the right tool can help us account for the fact that there is no now. In a distributed system “now” is an illusion–each system, each entity, and each one of us will have our own view of time that cannot be synchronized due to limitations of the physical world.