Predicting the Future of Distributed Systems

Predicting the Future of Distributed Systems

There are significant changes happening in distributed systems. Object storage is becoming the database, tools for transactional processing and analytical processing are becoming one in the same, and there are new programming models that promise some combination of superior security, portability, management of application state, or simplification. These changes will influence how systems are operated in addition to how they are programmed. While I want to embrace many of these innovations, it can be hard to pick a path forward.

If a new technology only provides incremental value, most people will question the investment. Even when a technology promises a step change in value, it can be difficult to adopt if there is no migration path, and risky if it will be difficult to change if it ends up being the wrong investment. Many transactional and analytical systems are starting to use object storage because there is a clear step-change in value, and optionality is mitigating many of the risks. However, while I appreciate the promise of new programming models, the path forward is a lot harder to grasp. If people can’t rationalize the investment, most will keep doing what they already know. As the saying goes, nobody gets fired for buying IBM.

I have been anticipating changes in transactional and analytical systems for a few years, especially around object storage and programming models, not because I’m smarter or better at predicting the future, but because I have been lucky enough to be exposed to some of them. While I cannot predict the future, I will share how I’m thinking about it.

One-Way-Door and Two-Way-Door Decisions

On Lex Fridman’s podcast,[1] Jeff Bezos described how he manages risk from the perspective of one-way-door decisions and two-way-door decisions. A one-way-door decision is final, or takes significant time and effort to change. For these decisions, it is important to slow down, ensure you are involving the right people, and gather as much information as possible. These decisions should be guided by executive leadership.[2]

Some decisions are so consequential and so important and so hard to reverse that they really are one-way-door decisions. You go in that door, you’re not coming back. And those decisions have to be made very deliberately, very carefully. If you can think of yet another way to analyze the decision, you should slow down and do that.
—Jeff Bezos

A two-way-door decision is less consequential. If you make the wrong decision, you can always come back and choose another door. These decisions should be made quickly, often by individuals or small teams. Executives should not waste time on two-way-door decisions. In fact, doing so destroys agency.[3]

What can happen is that you have a one-size-fits-all decision-making process where you end up using the heavyweight process on all decisions, including the lightweight ones—the two-way-door decisions.
—Jeff Bezos

It is critical that organizations correctly identify one-way-door and two-way-door decisions. It is costly to apply a heavyweight decision-making process to two-way-door decisions, but even more costly is a person or a small team making what they think is a two-way-door decision when it is really a one-way-door decision now imposed on the whole organization for years to come. Many technology choices are one-way-door decisions because they require significant investments and are costly and time-consuming to change.[4]

Object Storage

Cloud object storage is almost two decades old. While it is very mature and incredibly reliable and durable, it continues to see a lot of innovation.[5] With the amount of attention to backwards compatibility, systems integration, and interoperability, almost every investment in object storage feels like a two-way-door decision, and this will continue to accelerate adoption, investment, and innovation.[6]

Over a decade ago, I built a durable message queue for sharing industrial data between enterprises using Azure Blob Storage page blobs. Because the object storage provided object leases, atomic writes, durability, and replication, it allowed us to build a simple and reliable system without having to worry about broker leadership election, broker quorums, data replication, data synchronization, or other challenges. The read side was independent of the write side, stateless, and could be scaled completely independently. While the company I was working for couldn’t figure out how to leverage this infrastructure to its fullest extent, I knew that relying on object storage was an architecture that had many advantages and I expected to encounter it again.[7]

Fast forward to today and there are many systems—everything from relational databases, time-series databases, message queues, data warehouses, and services for application metrics—using object storage as a core part of their architecture, including transactional workloads and not just analytical workloads, archival storage, or batch processing.[8] In addition, object storage features have expanded to include cross-region replication, immutability, object versioning, tiered storage, backup, read-after-write consistency, conditional writes,[9] encryption, metadata, authorization, and more. These features can be used to address industry regulation, compliance, cost optimization, data lifecycle management, disaster recovery, and much more, and in a standard way across services, without having to build it directly into each application. So not only is object storage attractive from an architectural perspective, it is a win for simplicity and consistency.

While I can’t predict exactly how object storage will evolve, I expect the popularity of object storage to increase, especially for transactional and analytical systems. For example, storing data in Parquet files in Amazon S3 feels like a pretty safe bet. I expect read performance will continue to improve through reduced latency, increased bandwidth, improved caching, or better indexing, because it is something that will benefit the huge numbers of applications using S3.[10] If another storage format becomes more attractive than Parquet, I trust I can use an open table format, like Apache Iceberg or Delta Lake, to manage this evolution if I don’t want to reprocess the historical data. If I do want to reprocess the data, I can rely on the elasticity of cloud infrastructure to reprocesses files when they are accessed, or as a one-time batch job. I’m not worried about choosing an open table format, because they all seem excellent, they are converging on a similar set of features, and they will undoubtedly support interoperability and migration. Similarly, if I rely on an embedded library for query optimization and processing, like DuckDB or Apache DataFusion,[11] I expect them to continue to improve and share similar features.[12] In other situations, I might rely on Amazon Athena, Trino, Apache Spark, Pandas, or Polars for data processing. Tools will continue to improve for importing data from, or exporting data to, relational databases, data warehouses, and time-series databases. If I want to run the same services using another cloud provider, or in my own datacenter, there are other object storage services that have S3-compatible APIs.[13] In other words, lots and lots of two-way doors. Actually, it is an embarrassment of riches.

Object storage is also a very simple storage abstraction. Embedded data processing libraries, like DuckDB and Apache DataFusion, can use the local file system interchangeably with object storage. This opens up the opportunity to move workloads from distributed cloud computing infrastructure and embed them directly in a single server, or move them client-side, embedded in a web browser, or even embedded into IoT devices or industrial equipment controlling critical infrastructure.[14] The ability to move workloads around to meet changing requirements for availability, scalability, cost, locality, durability, latency, privacy, and security opens up even more two-way doors. With object storage, it’s two-way doors all the way down.

Programming Models

The most disruptive change in the next decade may be how we program systems—a fundamental change in how software is developed and operated—and even what we view as software and what we view as infrastructure—that most people have yet to grasp.[15][16] Many fail to see the value, and almost everyone is skeptical of how we get from here to there. While I believe the eventual outcomes are clear, the path forward is anything but. The fact that everything seems like a one-way door is hindering adoption.

I have been anticipating a shift in programming models for many years, not through any great insight of my own, but through my experiences building systems with Akka, a toolkit for distributed computing, including actor-model programming and stream processing. I saw how these primitives solved the challenges I had been working on for fifteen years in industrial computing—flow control, bounded resource constraints, state management, concurrency, distribution, scaling, and resiliency—and not just in logical ways, but from first principles. For example, actors can provide a means of modelling entities, like IoT devices, and managing state, but leave the execution and distribution of those entities up to the run-time, and in a thread-safe way. Reactive Streams provides a way to interface and interoperate systems, expressing the logic of the program, while letting the run-time handle the system dynamics in a reliable way. I could see how these models would logically extend to stateful functions and beyond, as I described in my keynote talk From Fast-Data to a Key Operational Technology for the Enterprise in 2018.

Today, there are many systems trying to solve these challenges from one perspective or another. If you squint, they break down into roughly three categories. The first category are systems that abstract the most difficult parts of distributed systems, like managing state, workflows, and partial failures. These are systems like Kalix, Dapr, Temporal, Restate, and a few others. These systems generally involve adopting the platform APIs in your programming language of choice. In the second category, in addition to abstracting some of the difficult parts of distributed systems, the platform will execute arbitrary code in the form of a binary, a container, or WebAssembly. Included in this category are wasmCloud, NATS Execution Engine,[17] Spin, AWS Fargate, and others. The final category are the somewhat uncategorizable because they are so unique, like Golem, which, if I understand correctly, uses the stack-based WebAssembly virtual machine to execute programs durably,[18] and Unison, which is an entirely new programing language and run-time environment.

However attractive or well engineered these solutions are, ten years from now, not all of these technologies, or the companies developing them, will exist. Even with the promise of solving important problems and accelerating organizations, it is nearly impossible to pick a technology because of this huge investment risk. Furthermore, so much of what matters is the quality and maturity of the tools for building, deploying, static analysis, debugging, performance analysis and all the rest, and most engineers are uncomfortable giving up control over the whole stack.[19] Adding to the skepticism are questions about how AWS, Azure, Cloudflare, and the other cloud service providers will enter this market with their own integrated and potentially ubiquitous solutions. At the moment, it seems like one-way door after one-way door.

As I see it, the biggest opportunity for a new programming model is extracting the majority of the code from an application and moving it into the infrastructure instead. The second biggest opportunity is for the remaining code—what people refer to as the business logic, the essence of the program—to be portable and secure. A concrete example will help demonstrate how I’m thinking about the future.

In addition to the business logic, embedded in almost all modern programs are HTTP or gRPC servers for client requests, libraries for logging and metrics, clients for interfacing with databases, object storage, message queues, and lots more. Depending on when each application was last updated, built, and deployed, there will be many versions of this auxiliary code running in production. To patch a critical security vulnerability, just finding the affected services can be an enormous undertaking.[20] Most organizations do not have mature software inventories, but even if they do, the inventory only helps with identifying the services, they still need to be updated, built, tested, and redeployed. Instead of embedding HTTP servers and logging libraries and database clients and all the rest into an application binary, if this code can move down into the infrastructure, then these resources can be isolated, secured, monitored, scaled, inventoried, and patched independently from application code, very similar to how monitoring, upgrading, securing, and patching servers underneath a Kubernetes cluster is transparent to the application developer today.[21] If the business logic can be described and executed like this, then it also becomes possible to move code between environments, like between the cloud and the IoT edge, or between service providers.[22]

To encourage adoption, new programming models must find ways to transform the one-way-door decisions into two-way-door decisions. WebAssembly may help with this. WebAssembly offers a secure way to run portable code, and the WebAssembly Component Model could be the basis of a standard set of interfaces that more than one platform can provide.[23] There may be other ways these platforms can encourage adoption by lowering risk, but the two most important things to me are: 1) not having to rewrite every application—in other words, some kind of migration path, rather than only greenfield adoption[24] and 2) not being locked into a single provider should I want to move to a different platform, or move workloads from the cloud to my own datacenter, or into embedded IoT.

What is the Future?

There are major shifts happening in the software industry. In the future, distributed systems will look different. The decomposition of databases, transactional systems, and operational technology to incorporate object storage is well underway thanks to many two-way doors. New programming models could be very disruptive, but with so many one-way doors, the challenge of picking the technology winners and losers has never been harder. It is easier to keep doing what we already know.

In a distributed system, there is no such thing as a perfect failure detector.
We can’t hide the complexity...our abstractions are going to leak.
Peter Alvaro

Programming a distributed system is hard because of the challenge of partial failures. Arguably, the success of object storage is partly due to abstractions that don’t hide all of the complexity. It remains to be seen how well new programming models can deal with partial failures without contorting the programming model itself. But these new systems are promising because they are getting back to basics, just with the lines of abstraction drawn in different places. This should result in systems that are simpler, more modular, with better separation of concerns, that are much easier to build, operate, maintain, secure, and scale. Perhaps the biggest question is, will the early adopters out-compete the others? Or will the rest of the industry catch up quickly once the new programming and operational models become clear? How safe is it to just keep doing what we already know?

Abstractions are going to leak, so make the abstractions fluid.
Peter Alvaro

It is impossible to predict the future and I’m not going to pretend I can foresee it better than anyone else. However, I am confident in the macro trends of continued investment in object storage and, some day, the widespread adoption of new programming models that move more code down into the infrastructure. It will be fun to look back in a few years.


  1. This is an excellent podcast on the leadership of complex organizations and products. Another highlight for me is Jeff describing his meeting culture. ↩︎

  2. The 2015 Amazon Letter to Shareholders also described the concept of one-way and two-way doors. Thanks to my friend John Mayerhofer for suggesting I incorporate this idea in this essay. ↩︎

  3. For more on agency, see my essay The Importance of Agency. ↩︎

  4. I’ve been through many platform and service migrations. Even when they are the right or necessary, they can take years. In his book An Elegant Puzzle, Will Larson notes that growing organizations will always be in the middle of a migration—“growth makes migrations a way of life”—and it is important to be good at them: “If you don’t get effective at software and systems migrations, you’ll end up languishing in technical debt.” ↩︎

  5. For a deep dive into the history and incredible engineering of Amazon S3, see Andy Warfield’s article and talk: Building and operating a pretty big storage system called S3. Marc Olson also published a deep dive into block storage: Continuous reinvention: A brief history of block storage at AWS. ↩︎

  6. Because of the volume of data stored in cloud object storage, it is difficult and expensive to migrate to alternative solutions. In a very positive way, this forces vendors to pay a lot of attention to migration paths and backwards compatibility. ↩︎

  7. For more information on this architecture, see Shared-Nothing Architectures for Server Replication and Synchronization. ↩︎

  8. Examples include Amazon Aurora, InfluxDB, WarpStream, Snowflake, and Grafana Mimir. See Chris Riccomini’s article Databases Are Falling Apart: Database Disassembly and Its Implications for a comprehensive exploration of this topic. ↩︎

  9. Azure Blob Storage has supported conditional writes for many years. Amazon S3 just added this feature. ↩︎

  10. For example, Amazon recently introduced S3 Express One Zone which offers millisecond latency, the AWS Common Runtime (CRT) libraries which can saturate network bandwidth for S3 file transfer, and Mountpoint for Amazon S3 for mounting S3 buckets on the local file system. It is also possible to rely on the cloud provider’s infrastructure and optimizations, many of which are only possible at scale, rather than solving these problems yourself. See this example from Andy Warfield. ↩︎

  11. DuckDB and Apache DataFusion are both incredibly high-quality open-source projects and both are lead by incredibly good engineers. These libraries are blurring what can be done inside a database versus outside a database. Databases and object stores are starting to meet in the middle. See the talks DuckDB Internals by Mark Raasveldt and Building InfluxDB 3.0 with Apache Arrow, DataFusion, Flight and Parquet by Andrew Lamb. ↩︎

  12. There can be too much focus on performance, rather than reliability, security, ease of use, and other important factors. Databases and query optimizers tend to converge on performance, because as soon as one discovers a performance optimization, the others also implement it. In addition, most innovations in databases and query optimizers outside of SQL tend to eventually be implemented in SQL. For more on these topics, see the article Perf Is Not Enough by Jordan Tigani and the talk A Short Summary of the Last Decades of Data Management by Hannes Mühleisen. ↩︎

  13. For example, Cloudflare R2 and MinIO. ↩︎

  14. I expand on this topic in my position paper Object Storage and In-Process Databases are Changing Distributed Systems. ↩︎

  15. Software developers are often accused of wanting to try the latest fads and tinker with their code until it is perfect. In an industry that changes rapidly, it is important to evaluate trends, but for the vast majority of developers I have worked with, they are all focused on shipping—satisfaction only comes when their work is in the hands of customers and providing value. Perhaps because I’ve always worked on industrial software, the engineers I have worked with have also preferred simple and practical solutions that maximize reliability. If I reflect on the small number of people I have seen struggle—people who could be accused of tinkering or preferring complexity—it is because either the objectives of the work were not clear, or the individuals lacked engineering skills to complement their programming skills. ↩︎

  16. One of the last big changes in infrastructure was the emergence of Kubernetes over Mesos, YARN, Ansible, Chef, and other technologies for managing the infrastructure itself. See the Deployment Coolness Specturm from Jay Kreps. Kubernetes has accelerated many organizations by decoupling the management of infrastructure from the development of software, especially when managing infrastructure across multiple environments or teams. ↩︎

  17. My impression is NATS included containers not to be the next container run-time, but because people are not yet ready for the leap to WebAssembly. ↩︎

  18. Golem can make imperative code resilient because Golem itself uses event sourcing for the deterministic WebAssembly instructions. It remains to be seen how Golem will adapt to multi-threaded WebAssembly. See the talk Building Durable Microservices with WebAssembly by John A. De Goes for more information. ↩︎

  19. In considering these new programming models, we would be well served not to forget Joel Spolsky’s essay The Iceberg Secret, Revealed. ↩︎

  20. This is what organizations experienced as they scrambled to patch the Log4J Log4Shell remote code execution vulnerability. ↩︎

  21. This point is better illustrated visually: see this part of my talk from S4. ↩︎

  22. It also becomes possible to experiment with techniques for continuous improvement to optimize performance or cost. This is a common technique in the process industries that I wrote about in Observations on Observability. ↩︎

  23. The interfaces could include object storage, relational databases, key-value stores, message queues, application logs, service metrics, authentication, and authorization. ↩︎

  24. I manged to make it this far into an essay about the future of programming distributed systems without mentioning generative artificial intelligence (AI) or large language models (LLMs). Perhaps they have a place in rewriting or porting code as recently described by Andy Jassy. ↩︎