On Embracing Error in Distributed Software Systems
This article is the companion to my article On Eliminating Error in Distributed Software Systems and expands on my talk What Lies Between: The Challenges of Operationalising Microservices from QCon London 2019.
In an earlier article, I explored techniques for eliminating error in distributed software systems, including testing, the type system, functional programming, and formal verification. While these techniques are effective for addressing specific categories of error, no one technique is sufficient for eliminating error. Even when used in combination, these techniques are not jointly sufficient, because their guarantees do not compose. Guarantees do not necessarily compose into systems. There are at least three things these techniques do not address: embracing failure as part of the system, rather than something to be eliminated; handling system dynamics at run-time; and addressing the human element of distributed software systems. These topics are the subject of this article.
Failure is intrinsic to distributed software systems. A failure can be as as simple as the absence of a message. Therefore, instead of focusing on failure as something to be eliminated, we need a run-time that is complimentary and embraces failure as part of the system.
Some days I believe that Erlang is the only language that has a good story on error handling, and everyone else is living in denial.
Erlang is a whole ecosystem, not just a programming language. It has a virtual machine—the BEAM—that has very unique properties for scaling, fault-tolerance, and isolation, and it has a collection of middleware—the Open Telecom Platform (OTP)—with tools for concurrency, distributed computing, and resilience. As much as anything, Erlang is a set of principles and patterns for developing and operating reliable and resilient systems, with great performance, at huge scale. Erlang embraces failure more systematically than checking return values, or catching exceptions, and encourages a different mindset: at scale, failure is normal, so just design for it.
Using OTP helps the developer avoid accidental complexities: things that are difficult because they picked inadequate tools.
Much of my experience with this approach comes from using Akka, an actor-model toolkit for distributed computing that was inspired by Erlang OTP. Actors provide a powerful way to model entities and help people think about complex systems. For example, an actor can have a short lifetime and be used to represent the sending of a single SMS message, or an actor can be long-running and represent the state of a single IoT device through a “digital-twin” expressed in software. The programmer only needs to worry about sending the single SMS message, or modelling the state of the individual IoT device, and the run-time handles scaling to millions across collaborating servers. If one entity is in a failed state, the other entities operate normally, because they are independent. The failure of an individual entity should never impact the whole system.
In addition to providing isolation and being a unit of concurrency, actors are great for managing state, executing state machines, and handling fine-grained failures, like modelling the fact that a single IoT device is offline and services cannot communicate with it. An actor can naturally model error states alongside operational states. For example, if an actor sends a command to an IoT device but the device fails to acknowledge it, the actor can model this fact, representing the uncertainty, and it can attempt to send the command again, up to a maximum number of retries, or a timeout interval expires.
Actors themselves can be arranged in hierarchies allowing higher-level actors to handle the failures of child actors. Supervising actors often have context that child actors do not have and can handle errors more systematically. It is often more reliable to simply restart an actor after a failure, rather than encoding complex logic for error handling and recovery. Recovery can also be composed with mechanisms for circuit breaking, bulkheading, and backoff-and-retry error-handling for the efficient use of resources and to avoid unnecessary load on dependent services that may be temporarily overloaded or unavailable. Where a service proxy can provide these mechanisms at a service level—at least in services that use request-response messaging—actor systems can provide these mechanisms, independently, for individual actors. The actor system can even be used to intelligently shed load under certain conditions, by terminating whole branches of child actors.
While actors are great for handling fine-grained failures, cloud-native infrastructures are very complimentary for handing coarse-grained failures. For example, consider a container running on Kubernetes. Within the container there is a run-time managing many actors. If the container exits due to an unhanded exception, Kubernetes will restart it, with an exponential backoff. The same is true if Kubernetes determines a container is running but is unresponsive, after failing its liveness probe. At an even more coarse-grained level, we no longer have to worry about which server a pod is running on, and if there is a failure of the underlying server, Kubernetes will handle restarting the impacted pods on another server. Kubernetes also provides its own mechanisms for service isolation, and for scaling up or down. This model of having an infrastructure that can handle coarse-grained failures, paired with a run-time that can model fine-grained entities and handle fine-grained failures, is a powerful one. It is the direction that stateful serverless appears to be heading.
End-to-end correctness, consistency, and safety mean different things for different services, is completely dependent on the use-case, and can’t be outsourced completely to the infrastructure.
Relying on the run-time or the infrastructure for managing failure is not enough. Especially in operational technologies, failure needs to be embraced all the way into the data model. Consider aggregating the power produced by a collection of wind turbines. Normally the wind turbines report their power output to the server, independently, every few seconds. How should the aggregate power be represented if only 19 of 20 wind turbines have reported within the past ten seconds? Is the missing turbine offline and producing no power, or is it network partitioned and the messages from it are delayed in transit? This uncertainty must be represented in the aggregate so that people, or other systems, can make decisions that incorporate this uncertainty.
A huge amount of discussion centres on data consensus sharing protocols for server (+) promises, but almost nothing is written about the responsibility of the receivers (-) who ultimately shoulder the burden of dealing with inconsistency.
There are typically many observers of data in distributed systems. Due to eventual consistency and disparate time-scales, observers cannot blindly trust the data and must make their own inferences based on dimensions like latency, integrity, quality, and authenticity, in combination with their local context. For instance, an observer may consider probabilities, like combining forecasts for load, demand, and weather, with current conditions and historical observations, when managing electrical generation or energy market participation. Or an observer may consider timescales dictated by physical constraints, like the fact that a holding tank can only empty so fast, or that a battery has a maximum rate at which it can charge, or the fact that an accumulator cannot go backwards.
Finally, embracing failure as something that is expected can influence how systems are architected. For example, most modern airliners are twin-engine jets, rather than four-engine models. One would think that four engines would be preferable, since the failure of one engine means losing only one quarter of the thrust, but twin-engine jets have proved equally safe. They are simpler to operate and less expensive to maintain, since the chance of an engine failure on a four-engine jet is fifty-percent greater than on a two-engine jet. Tying this idea back to software systems, if Kubernetes will handle restarting a failed pod in short order, either due to it exiting or becoming unresponsive, then it may not be worth investing in the complexities of making the service more resilient or highly-available at the software level. In other words, running a single pod may prove more reliable than a poorly implemented fail-over, data-replication, or clustering model, and simply crashing and restarting may prove more resilient than implementing complex error-handling.
The vast majority of the techniques for eliminating errors in software systems are applied before the software is in an operational setting. Examples include linting, type checking, static analysis, unit testing, regression testing, code review, fuzzing, and formal verification. I discussed how run-times and infrastructures are complimentary for accommodating failure, but these tools are also limited in that their primary response is to simply restart the failed entity. Complex software systems, especially at scale, are dynamical systems. It is not enough to examine these systems statically, or in isolation, and binary failure models are insufficient at run-time. To minimize failure and improve performance, the software must measure, model, and adapt to system dynamics.
Consider an HTTP request that fails with an HTTP 404 Not Found error because the resource does not exist. An instant later, the same request may return successfully. Is this an error? In an eventually-consistent system, behaviours like this are expected. The resource may have been in the process of being created, or the second request may have been serviced by a different backend that has a different view of the world. An individual error like this is often masked by retries, timeouts, caches, or fall-backs. On a larger scale, if the same service suddenly started returning HTTP 404 errors for every request, we probably want to consider this failure differently. To decide if an error is indicative of a failure, we need context and we need to consider system dynamics.
In the end, we can't come up with crisp definitions of “error”, because it's too contextual. In some cases it's useful to consider a particular signal an error, in other cases it isn't, and we can't make a classification decision in advance to cover both cases.
Immutability is a core tenet of functional programming and an important element in run-times like Erlang and Akka that use message passing for isolation and concurrency. It is also used to improve the reliability of infrastructures through declarative deployments and desired-state management. Despite its advantages for reducing error, immutability is also an illusion. A data structure, a message, or the code in a deployment may be immutable, but if the system is doing anything useful, it is almost certainly mutating something, like processing and acknowledging messages from a message queue; writing data to a database; executing transactions in a workflow; or sending commands to an industrial device. What looks like immutability on one scale is usually the result of dynamics from mechanisms for synchronization, retry, error-correction, data-replication, or fall-backs on another scale. Even allocating the memory for an immutable message or data structure is subject to the dynamics of the memory manager and the hardware. Stateless approaches that are used to reduce complexity and minimize error provide a similar illusion. Statelessness depends on the granularity at which you view the system.
Statelessness refers to memorylessness in the sense of a Markov process: the dependence on current state is inevitable as long as there is input to a process, but dependence of behaviour on an accumulation of state over many iterations is what people usually mean by stateful behaviour.
In messaging systems that process streaming data, time series, events, commands, or even request-response, handling message spikes is a key problem to address, while remaining responsive and respecting resource constraints. It is common to apply backpressure to the messages from upstream systems. Many systems implement backpressure crudely by polling or blocking, making inefficient use of resources. Reactive Streams is an initiative to implement backpressure more efficiently through non-blocking flow control. The Akka Streams API is one example of a library that implements the Reactive Streams protocol. It allows programs implemented with Akka Streams to bend and stretch with the dynamics of the system—much like fluid flowing through a series of pipes—while the programmer just focuses on functional programming. Akka Streams can be combined with distributed, actor-model programming to build Reactive Systems.
The Akka Streams API provides sophisticated primitives for concurrency, flow control, stream processing, partitioning, error handling, and interfacing with other services—much more sophisticated than APIs dedicated to a single platform or technology that are focused on the mechanics of stream processing, but not the dynamics. While the Akka Streams API is specific to the JVM ecosystem, I expect this type of programming model to be more and more pervasive for handling system dynamics and system integration, like tying together the execution of disparate serverless functions. The stream will even become a sort of call stack for understanding the logical and temporal interactions in these distributed messaging systems.
In messaging systems, some errors can only be observed dynamically. Consider an IoT device that erroneously reports an old message to a server every few seconds. If this message is committed to an idempotent data store, where a message with the same timestamp simply overwrites any previous message at that timestamp, then it is impossible to detect this behaviour once the data are at rest. This situation may not cause any logical problems in the system, but it can waste significant resources and strain systems to the point of failure. The only way to detect it is to inspect the messaging dynamics at run-time.
Empirical methods for mitigating failures are influenced by system dynamics and if they do not explicitly model the dynamics, they can make failures worse, as errors tend to get magnified in strongly connected systems. A system that relies on keep-alive messages may make an erroneous decision if messages are delayed, reordered, or lost. Clustering to prevent failure, providing high-availability through state replication, is subject to split-brain scenarios or situations where a subset of the cluster is overwhelmed after a network partition. Kubernetes liveness and readiness probes are simple tests from the control plane to determine if a container is ready to serve requests, or if it is unresponsive and needs to be restarted, respectively. These simple tests, designed to recover from failure, can lead to all sorts of catastrophic failures if they do not consider system dynamics. In addition, combining liveness and readiness probes with the dynamics from circuit breaking and backoff-and-retry error-handling in the application, and in a service proxy, and from the control plane, can provide a rich source of unexpected dynamics and interactions.
There has been a recent focus on improving our understanding of the operation of distributed software systems through tools for “observability”. I dislike the term observability as it is currently used, disconnected from controllability. For operating distributed software systems at scale, modelling and controlling system dynamics will become more and more important. Designing and operating these systems will look increasingly like traditional process engineering. In a way, controllability can be viewed as another form of supervision, similar to that of actor systems or desired-state management, but a form of supervision that is able to measure and adapt to complex system dynamics, improve performance, and avoid catastrophic failures. Current observability tools focus too much on understanding single threads of execution, or collecting many threads of execution that are later correlated to find meaningful signals. We will move beyond this focus to embrace higher-order signals that allow us to understand the surrounding environment and react to system dynamics. We will move from trying to understand system dynamics, to controlling system dynamics in real-time. In addition, we will move beyond debugging unknown-unknowns or seeing if things are possible, to understanding if our systems are stable.
With so much focus on technologies for eliminating error and handling failure in distributed software systems, it can be easy to lose sight of the fact that people are an integral part of these systems. People, after all, design, develop, and operate these systems, and people interact with the products and services that these systems ultimately provide. This means human factors and sociotechnical considerations must be at the forefront when embracing error in distributed software systems.
The book Accelerate: Building and Scaling High Performing Technology Organizations, by Nicole Forsgren, Jez Humble, and Gene Kim, has a wealth of information on what differentiates high-performing teams. The quality of the work, along with how the team addresses error and failure, is a major factor. The research demonstrated a relationship between software quality and return on investment, customer satisfaction, and employee fulfilment. Teams that focus on quality spend less time on rework or addressing symptoms from poor underlying quality.
I do not believe quality can be negotiated with a team. When the team is responsible for the outcomes, the team must be given the authority to make quality decisions. Furthermore, people will not work below their internal quality-bar for very long without looking for something else to work on. Speed and stability are outcomes that enable each other, yet many organizations put them in conflict.
We felt we owned quality. That helped us do the right things. Too often, quality is overshadowed by the pressure for speed. A courageous and supportive leader is crucial to help teams “slow down to speed up,” providing them with the permission and safety to put quality first (fit for use and purpose) which, in the long run, improves speed, consistency, and capacity while reducing cost, delays, and rework. Best of all, this improves customer satisfaction and trust.
Software failures are often the result of technical debts. For example, a disk becomes full or a certificate expires because it was not monitored, or a service that has been working reliably for many months suddenly fails under increasing load, highlighting the fact that there was never time to test for scalability. Technical debt can also include systems that do not have an owner, or teams that have not established means for working together. Development teams are often pressured to meet deadlines and compromise on quality or completeness. There is always an intention to return to the work after the deadline, but once it is out of sight, it is out of mind, and the organization is on to the next feature or milestone. Because infrastructural software plays a supporting role and is not usually the product itself, it can be very difficult to surface latent quality issues to the larger organization, until something fails catastrophically. Once the issue becomes critical, often there is no quick fix. This can be demoralizing to the development team if the failure pattern is well-known, but they were never given the time to address it. Managers, including product, program, and project managers, and even executives, are understandably focused on feature development, but an effective manager will be equally focused on technical debt, re-architecture, process improvement, security, reliability, resilience, and automation. These are as much a part of the overall product as the next feature milestone.
For a component that appears to be working, a non-programmer will assume the technical debt is minimal, or non-existent.
—The Iceberg Secret Is Just the Tip of the Iceberg
Many organizations operate under the “you-build-it, you-operate-it” model, where development teams are responsible for deploying, monitoring, evaluating, and supporting production services, rather than relying on a separate operations team for these functions. The research in Accelerate showed that a key indicator of success was how often errors occur when changes were made and what the response was when something inevitably went wrong. Is the team quick to revert, fix-forward, patch, or is there a prolonged outage? Modern systems are designed to handle failures, so instead of focusing on the mean-time-to-failure, focus instead on the mean-time-to-recovery. Rather than trying to eliminate failures, acknowledge that failures will happen and be good at responding to them. With that said, these failures must be rare enough that people are not dealing with them all the time, regularly disrupting their work, focus, leisure, or sleep.
How an organization responds to failure is very instructive. Pathological organizations look to find the person or persons “responsible”, holding them “accountable” by and punishing or blaming them. Too many organizations try and fix the people, rather than the work environment. Software systems are complex, and the people and organizations that build and operate them are equally, if not more, complex. Despite the desire for logical explanations of root cause after a failure, this focus can do more harm than good.
Within the bounds of what a person in that part of the system can see and know, the behavior is reasonable. Taking out one individual from a position of bounded rationality and putting in another person is not likely to make much difference. Blaming the individual rarely helps create a more desirable outcome.
—Donella H. Meadows
Accelerate found that team dynamics and psychological safety are the most important aspects in understanding team performance. When a failure occurs, do people criticize, blame, demand, yell, or outright ignore? Or do people drop what they are doing and come together to help, supporting each other, with a bias for action, dividing and conquering, and remaining as calm as possible? After an incident, do people ensure the follow-up items are addressed, or do people allow the underlying issues to fester? One of the reasons I focus so much on team ownership of projects and services, rather than individual ownership, is that it not only leads to better solutions, but it fosters a shared and supportive environment.
At the heart of lean management is giving employees the necessary time and resources to improve their own work. This means creating a work environment that supports experimentation, failure, and learning, and allows employees to make decisions that affect their jobs. This also means creating space for employees to do new, creative, value-add work during the work week.
After a serious incident, an understandable reaction is to require approvals for high-risk changes, but this does not correlate with performance. In fact, the research in Accelerate showed that no approval or peer review achieved the highest performance. Another inclination can be to make high-impact changes during off-peak hours, such as weekends, holidays, or at night. The evidence showed that the highest performing teams made changes during normal business hours. If an off-hours deployment does lead to a failure, not all members of the team, or supporting teams, are available to assess the situation, increasing risk. Outside of regular business hours people may be tired, distracted, or resentful. Encouraging changes during business hours, even for mission-critical systems, leads to better outcomes, including teams thinking of ways to make changes using transparent or incremental approaches, rather than a big-bang deployment.
Static stability is something you can see; it’s measured by variation in the condition of a system week by week or year by year. Resilience is something that may be very hard to see, unless you exceed its limits, overwhelm and damage the balancing loops, and the system structure breaks down. Because resilience may not be obvious without a whole-system view, people often sacrifice resilience for stability, or for productivity, or for some other more immediately recognizable system property.
—Donella H. Meadows
The research in Accelerate found that a loose coupling among teams was correlated with performance. Signs of loose coupling are teams that can make large-scale changes without outside permission or coordination, and teams that can complete work in its entirety, without coordinating it, or communicating it, outside of the team. Loose coupling is one of the attractions of microservice architectures in recent years, but there is a tension between loose coupling in relation to error and failure. The organic, federated growth of services can lead to unanticipated interactions and dynamics, or, ironically, even tight coupling that was not intended. People and teams tend to focus on their local problems, unable to grasp the larger impacts, which can introduce failure modes that were not anticipated. Functional programming, as I discussed in the previous article, can help ensure that programs are correct and remain correct over time, allowing teams to focus more on their local context. But with organic, federated growth, understanding and controlling system dynamics, at a high level, as I highlighted in the previous section, becomes increasingly important. When services are loosely coupled, maintaining the conceptual integrity of the system can be a real challenge.
The more details we can see, the less we have a sense of control.
Lastly, when systems are message-based and eventually consistent, it is important to consider how people interact with them and how the dynamics are perceived by the end user. This can be especially challenging when there are multiple perspectives of the system, which is common at the intersection of IoT and consumer applications. An IoT device may reflect one state locally, but this state may not have yet propagated to the cloud. Trying to view the same data in a mobile or web application may highlight discrepancies. For instance, the pricing information shown in a local user-interface may be different than the pricing shown on-line, since they propagate through different channels, each with their own dynamics, dependencies, and failure patterns. It is important to build experiences that are as consistent as possible and eventually converge. Considering the experience of the people interacting with the system, especially in the event of partial or intermittent failures, is critical for delivering a good customer experience.
Eliminating error in distributed software systems is impossible. These systems have complex relationships and emergent behaviours; messages are delayed, lost, and reordered; and, at scale, some components are always failing or unreachable. It is also a world where our view is only eventually consistent and always a view of the past. The good news is that the tools at our disposal for building and operating distributed software systems are better than they have ever been, but in order to effectively embrace failure, we need to understand how to compose these tools.
Bottom-up approaches for eliminating error—testing, the type system, functional programming, and formal verification—help us build and evolve more reliable systems, but these techniques only eliminate error within a certain context. Their guarantees do not compose into complex systems. Using a top-down approach and expecting a run-time or an infrastructure to eliminate error is also imperfect. Failure is very often a local consideration, extending all the way to individual entities, or even into the data model.
When systems become large it’s hard to make rules that work well for each local situation. Broad generalizations are sub-optimal for local variance.
We ultimately need run-times that embrace error, on multiple levels, as germane. We need run-times that help us model errors, rather than eliminate them. We need to view distributed software systems as dynamical systems. We need tools for adapting to and controlling system dynamics at run-time. Finally, these systems are built and operated by humans and they provide products and services that humans interact with. Therefore, we need to consider that fallible people are designing and operating these systems, and interacting with their final product.
Interestingly, despite Erlang's reputation for uptime, reliability, and scalability, it is a dynamically-typed language. ↩︎
I expanded on this idea in my Reactive Summit keynote and in my article Essential Software Tools for Developing Operational Technologies. See also this Tweet and this Tweet. ↩︎
See the Extended-Range Twin-Engine Operational Performance Standards (ETOPS). This is similar to a RAID where increasing the number of disks increases the chance of a single disk being in a filed state. ↩︎
Application developers should not be interfacing with the Reactive Streams protocol directly. Instead, they should be using a high-level library that uses the protocol under the hood. See my article Akka Streams: A Motivating Example for more information. ↩︎
See my series entitled Integrating Akka Streams and Akka Actors for an example of how streams and actors can be composed to build reactive applications. ↩︎
The strengths of the Akka Streams API relative to Spark Streaming and Kafka Streams are highlighted in the article Choosing a Stream Processing Framework: Spark Streaming or Kafka Streams or Alpakka Kafka? by Unmesh Joshi. See the following articles of mine to appreciate some of the sophisticated primitives that the Akka Streams API provides: Patterns for Streaming Measurement Data with Akka Streams, Maximizing Throughput for Akka Streams, Partitioning Akka Streams for Scalability and High-Availability, and Backoff and Retry Error-Handling for Akka Streams. ↩︎
Refer to Lorin's Conjecture. ↩︎
I wrote a series of articles on these problems: Kubernetes Liveness and Readiness Probes: How to Avoid Shooting Yourself in the Foot. ↩︎
See my article Observations on Observability. ↩︎
One of my favourite things on the Internet is Tim Lister's talk We're on a Mission From God: The Return of Peopleware. It has profound insights related to the intersection of teams, organizations, and quality. An example quotation, edited for brevity:
The interesting thing about it is this whole notion of pride of workmanship. What I think Deming is saying is, other rewards are crass and pride of workmanship can be driven out by dangling other rewards. You drive pride of workmanship out and then the quality of the product goes right into the dumpster. Pride of workmanship is fragile and anything you do where you try to dangle incentives away from the pride of workmanship is very dangerous. One of the things I see in jelled teams is there's kind of a cult of quality. On a good team, there's almost not even talk about quality, in the direct sense of this theoretical notion, it is all concrete stuff—let's look at this code: is it good enough to be part of the product? It's not a matter of, "What is the theory of quality?" There becomes a body of work that you're putting together. I look at good developers and their internal standard of quality is so above the standard of quality for the company that it is irrelevant to talk about company standards—it's just demeaning. A good engineer will just not do some things.
By compromising on quality, I mean delivering a lower quality than the team is comfortable with, not negligence through compromising on safety, security, or privacy, although I suspect this has happened in some organizations. ↩︎
This was one of my motivations for developing the Quality Views technique. ↩︎
Being woken up to deal with a failure, or even the anticipatory anxiety of being on-call, can have very detrimental impacts on our health and our work. With sleep, we have no mechanism for storing it, or making up for lost sleep, like we have fat for making up for times where food may be in short supply. I highly recommend reading the book Why We Sleep by Matthew Walker and listening to Peter Attia's interview with the author for understanding the importance of sleep for our health and our work. ↩︎
I have seen some amazingly creative solutions motivated by the constraint of making high-risk changes during normal business hours that have lead to much better customer, developer, and operational experiences. ↩︎
I expand on some of the challenges of operating, maintaining, and evolving a large number of microservices in my article Concordance for Microservices. ↩︎