II. Accept Uncertainty
Build reliability despite unreliable foundations
As soon as we cross the boundary of the local machine, or of the container, we enter a vast and endless ocean of nondeterminism: the world of distributed systems. It is a scary world in which systems can fail in the most spectacular and intricate ways, where information becomes lost, reordered, and corrupted, and where failure detection is a guessing game. It’s a world of uncertainty.
Most importantly, there is no “now”. The present is relative and subjective, framed by the viewpoint of the observer. The fundamental problem of this world, due to the lack of consistent and reliable shared memory, is the inability to know what is happening on another node now. We must acknowledge that we cannot wait indefinitely for the information we need for a decision. As a consequence, our algorithms will lack information due to faulty hardware, unreliable networks, or the plain physical problem of communication latency. Data is out of date by the time it’s acknowledged and we are forced to deal with a fragmented and unevenly outdated state of things.
Even though there are well established distributed algorithms to tame this uncertainty and produce a strongly consistent view of the world, those algorithms tend to exhibit poor performance and scalability characteristics and imply unavailability during network partitions. As a result, for distributed systems, we have had to give up most of them as a necessary tradeoff to achieve responsiveness, moving us to agree to a significantly lesser degree of consistency, such as causal, eventual, and others, and accept the higher level of uncertainty that comes with them.
This has a lot of implications: we can’t always trust time as measured by clocks and timestamps, or order (causality might not even exist). Accepting this uncertainty, we have to use strategies to cope with it. For example: rely on logical clocks (such as vector clocks ); when appropriate use eventual consistency (e.g. certain NoSQL databases and CRDTs ); and make sure our communication protocols are associative (batch-insensitive), commutative (order-insensitive), and idempotent (duplication-insensitive).
The key is to manage uncertainty directly in the application architecture. To design resilient autonomous components that publish their protocols to the world—protocols that clearly define what they can promise, what commands and events will be accepted, and, as a result of that, what behavior will trigger and how the data model should be used. The timeliness and assessed accuracy of underlying information should be visible to other components where appropriate so that they — or the end-user — can judge the reliability of the current system state.