Essential Software Tools for Developing Operational Technologies
This article expands on a keynote that I gave at Reactive Summit 2018. I also discussed these topics on the Real-World Architecture Panel at QCon San Francisco 2018.
An operational technology combines hardware and software to monitor and/or control the physical state of a system. Examples include systems used in power generation, manufacturing, connected vehicles, and appliances and devices in our homes. These systems provide critical services that we rely on everyday. The software that is used for monitoring and controlling these systems must operate with low latency—reacting to changes in near real-time—and it must be robust, reliable, available, safe, secure, and scalable.
Using software as an operational technology is not new—it has been used in industry for years—from manufacturing, to telecommunications, the chemical process industries, and electrical generation, transmission, and distribution, to name a few. These systems were generally of a smaller scale[1] and they were purpose built, using mainly proprietary software. They were not typically connected to the Internet, or even the corporate network, usually running on dedicated and isolated networks.
So what has changed? Operational technologies are now more connected and integrated with the rest of the business than they have ever been in the past. Many are also connected to the Internet and they are evolving to use cloud-native architectures and open-source software. The demands for monitoring and controlling devices and systems in near real-time, especially over the Internet and especially at Internet scale, presents unique challenges in terms of performance, scalability, and security. Despite this evolution, the traditional demands remain: for software to be an operational technology, it must be low latency, robust, reliable, available, safe, secure, and scalable.
In this article, I will explore the software tools that I believe are essential for building operational technologies. I will start by highlighting the importance of managing state. Then, I will examine tools for distributed computing, messaging, and handling data in motion in a reliable manner, including streaming data and time-series data. I will conclude with some thoughts on failure. I am not going to focus on one particular technology or another, but rather focus on these tools as more abstract building-blocks. Earlier this year, I wrote a series of articles on the unique challenges of operationalizing time-series data. You may find that series a useful introduction, providing some context and motivation for what I will explore in this article.
Managing State
For industrial software, representing the state of a physical asset or system is essential for monitoring and controlling it. The state might be an instantaneous sensor measurement for one device—like the power output of a wind turbine, or the flow rate through a heat exchanger—or it could be a sequence of events that describes the current state—like the sequence of actions a robot took to achieve its current position. In addition to representing the state of an asset, it is common to model the relationships of the asset to other assets, maintaining hierarchical, directional, proximal, or temporal associations.
Equally important to modelling the instantaneous state of an asset, is using a state machine to describe whether the asset is online or offline, charging or discharging, emptying or filling, or in one of a set of modes, like stopped, starting, running, idle, manual, automatic, or a mixture thereof. A state machine allows for different behaviours—expressed in software—depending on the state of the asset, like representing uncertainty in the last value reported from a sensor if the asset is offline. It can be important to persist the current state so that when the application restarts, it can represent the last state observed. Event Sourcing is a popular technique for recording the state of an entity and it supports reconstructing the last state observed, either from a snapshot, or by replaying the set of observations that lead to it.
Developing scalable and efficient software services for monitoring or controlling assets or processes will almost certainly demand shared access to this mutable state from multiple threads. Introducing reader-writer locks, or other mechanisms of mutual exclusion and coordination, significantly complicates these software services and commonly results in race conditions, data corruptions, or deadlocks.[2] Significant development time and operational attention will be dedicated to the mechanics of writing correct, multi-threaded software, diverting attention from writing software that effectively models, monitors, or controls the assets or processes themselves.
My preferred tool for modelling state, while also supporting efficient and safe, multi-threaded programming, is the actor model. The actor model is an approach to concurrent computation. The actor, which is essentially just an object, is the unit of computation. It can maintain mutable state, safely, in that the only way to interact with an actor is by sending it an immutable message, freeing the programmer from having to manage locks. The actor is thread-safe because it only processes one message at a time, while subsequent messages are queued in its mailbox. In response to a message, the actor can update its state, perform side effects, like reading from or writing to a database, create more actors, or send a message to another actor, including responding to the message that it received, in an asynchronous manner.
The actor is a wonderful tool for modelling in operational technologies. Because an actor is just an object, it is extremely lightweight. This means that an actor can be used to model an individual asset—a turbine, a battery, a thermostat, a vehicle—or a process—a transaction, a discharge cycle, a journey, a control-variable excursion—and thousands or millions of these actors can be instantiated to represent the state of thousands or millions of physical assets or processes. This approach is sometimes referred to as a digital twin. The actor can represent the instantaneous state of the asset or process, its relationship to other assets or processes, and it can execute a state machine, changing its behaviour based on the latest input.
I worked for many years on a multi-threaded, multi-process server application written in C++. I really enjoyed the intellectual challenge of this work and the exposure it gave me to the fundamentals of memory allocation, data structures, threads, synchronization methods, and IO. However, I have been using actor-model programming for the past three years and I haven't had to deal with any of the traditional, multi-threading issues, like race conditions, data corruptions, or deadlocks. There are a few things that I miss,[3] but, in general, not having to worry about these issues is liberating. For the types of software systems that I work on, the actor model is a great tool for modelling assets and processes, managing state, and managing concurrency.[4]
Distributed Computing
Traditionally, industrial software was proprietary, purpose built, and often very expensive. The software systems would generally be co-located with the assets they were monitoring or controlling, regularly managed by the operational technology team, rather than the information technology team. Because the software was monitoring or controlling a relatively small number of assets and ordinarily serving a dedicated purpose, a single server, or a small number of collaborating servers, was usually sufficient for running these workloads. A few trends have converged to bring more distributed computing to industrial software.
With the need to share the data from these systems with many other services—integrating with business systems for reporting, analysis, and asset management; training and executing models for condition-based maintenance; or updating a data warehouse for fleet-wide analysis—these systems are no longer built for a single purpose. The polyglot nature means that these systems are more connected with other software systems, rather than running on strictly isolated networks dedicated to monitoring, command, and control. Many of these systems are now also connected to the Internet, either to provide auxiliary services—like aggregation in a data warehouse—or to enable fundamentally new services and business models—like peer-to-peer ride-sharing services, or distributed energy-storage—that involve monitoring and controlling assets over the Internet, from anywhere in the world.
These increasing demands have pushed the limits of these systems in terms of scalability. Part of this comes from the need to collect and process more and more data, but especially with the rise of the Internet of Things (IoT), people are now building systems for monitoring and controlling physical assets at Internet-scale, where there could be thousands or millions of assets—thermostats, lighting, cameras, vehicles, distributed energy-generation and storage. The need to distribute diverse workloads reliably, seamlessly, and securely, across a collection of computing resources, has become critical.
Here again, my preferred tool for addressing these demands is the actor model. As I already mentioned, using an actor to represent the state of an asset or process, using a digital-twin approach, can be really advantageous for managing state and concurrency. In addition, considering the actor as the autonomous unit of computation and distribution, supports the development of distributed architectures for these systems in a wonderfully complimentary way.
A particularly powerful tool is the ability to shard many similar actors across a collection of servers in such a way that any individual actor can be referenced—so that you can send it a message or query it—without needing to know which server it is running on. This enables modelling thousands or millions of assets and processes—vehicles, turbines, thermostats—in a dynamic, transparent, and scalable manner.[5] As the number of assets or processes you are modelling change, or as the servers they are running on are stopped, started, restarted, scaled up, down, or failed, the runtime will automatically rebalance the sharded actors that are representing these assets. The combination of location transparency, dynamic scaling, and recovery from failure, relieve the programmer from having to tackle many challenging distributed-systems problems, while still providing the concurrency primitives and low-latency messaging of the actor model.[6]
Messaging
With the requirements for distributed computing and polyglot architectures for operational technologies, messaging among various software components, or even fine-grained entities, like actors sharded across a set of servers, becomes an essential element of the system. In order to react to events as quickly as possible, operational technologies are almost always event-based systems, with time-series data playing a prominent role. Most operational technologies process and react to an unbounded stream of immutable messages—facts from the past, like measurements from sensors, changes to assets and asset relationships, transactions in business systems, and even changes to the application software itself, like an auto-scaling event, or the automatic failover from a primary system to a backup system.
The dominant messaging pattern in so many software systems is REST—although many of these APIs are not actually RESTful, they are just HTTP APIs. With the need for low-latency and event-driven messaging in software for operational intelligence, we need a lot more than just HTTP APIs for doing CRUD—create, read, update, and delete—operations. We need messaging that supports two-way communication, server push, publish-subscribe, and support for a variety of protocols, like MQTT, AMQP, OPC, Modbus, and so on.
Lots of attention has been paid to connecting systems together through some common messaging layer that supports message queuing and/or publish-subscribe—Kafka, RabbitMQ, NATS, Google Pub/Sub, Azure Queues, Azure Service Bus, AWS Kenesis, AWS Simple Queue Service, the list goes on. A robust messaging layer is certainly a huge improvement over managing a spaghetti of one-to-one service connections. In addition, durable message-queues and publish-subscribe messaging are essential tools for developing operational technologies. Traditionally, the challenge has been in connecting systems together in a reliable and scalable manner, but as these messaging infrastructures are becoming a commodity, we will see that the real difficulties with operational technologies exist not in connecting services together, but in the connections themselves.[7]
For shipping application log messages, in a standard format, to a service for searching and aggregation, it makes sense to use something off the shelf to easily connect things together. But for almost everything else, simply connecting services together is too simplistic. Representing latency, failure, and uncertainty in operational technologies—often all the way down to the level of an individual asset—is almost always an application-level decision and not something that is emergent from simply connecting systems together using one technology or another or by putting a service mesh between them. Undoubtedly, the messaging infrastructure needs to be first-class, but operational technologies are built on top of these infrastructures, not from the infrastructure itself. Programming models on top of these infrastructures are essential for modelling state, representing uncertainty, expressing business logic, or interfacing with legacy or proprietary systems.
Data in Motion
With the event-oriented nature of operational technologies, messages are processed in their denormalized form, as they arrive, in a streaming manner. Many of these data streams—like readings from a sensor—are continuous and unbounded. These systems may also periodically operate on huge but bounded datasets in a streaming manner, like importing or exporting data from a subsystem, processing messages from a device that has been offline for an extended period of time, or reprocessing historical data from a durable message-queue.
To deal with the complex dynamics of data in motion in a reliable way, operational technologies need to use backpressure to ensure bounded resource-constraints and to ensure faster systems cannot overwhelm slower systems. Some systems do this in a naive way by polling, blocking, or artificially throttling messages. The Reactive Streams specification is a set of interfaces for non-blocking flow-control and can be used to propagate backpressure across system boundaries, in an efficient manner. The specification itself is intended as a service-provider interface. In other words, it is for library developers. Application developers should not be worried about the Reactive Streams interfaces—most of us should never encounter them—we should be using a higher-level abstraction that offers semantics for streams of messages, like mapping, filtering, throttling, partitioning, merging, and so on, with the Reactive Streams interfaces taken care of under the hood. Things like throttling or partitioning streams are not conceptually difficult, but the subtleties of their implementations can be, especially for unbounded streams of messages. These are not tools that I want to implement myself—I want them to be native to my streaming-data libraries.
One high-level, Reactive Streams library is the Akka Streams API. If you want a flavour for why libraries like this are so useful for developing operational technologies, see my articles Akka Streams: A Motivating Example and Patterns for Streaming Measurement Data with Akka Streams. Another aspect in which the Akka Streams API really shines in terms of messaging is the Alpakka project, which is a collection of Akka Streams APIs for interfacing with all sorts of systems using Reactive Streams. This returns to my theme of not wanting to just glue a bunch of systems together, but, instead, have tools that solve these fundamental problems in support of developing software applications. I expect to see similar ecosystems evolve around other stream-processing libraries.
Failure
Finally, a few words on failure. Since operational technologies monitor and control physical systems, these physical systems can fail, as parts wear, components fail, or fail-safe mechanisms intervene to prevent unsafe operation. The software systems monitoring and controlling these physical systems can also fail, as servers are restarted, or messages are delayed or lost in transit. Rather than viewing failure as something exceptional, it is best if the tools you are using embrace failure as part of the model.
Once again, the actor model is complimentary for handling failure. An actor executing a state machine can be used to model failure states alongside operational states. For example, if an actor sends a command to a device but the device fails to acknowledge it, the actor can model this fact, representing the uncertainty, and it can attempt to send the command again, up to a maximum number of retries, or a timeout interval expires. Actors themselves can be arranged in hierarchies allowing higher-level actors to handle the failures of child actors. These supervising actors often have context that child actors do not have and can handle errors more effectively. They can also provide mechanisms for circuit breaking and backoff-and-retry error-handling. Actors are perfect for handling fine-grained failures—like modelling the fact that a device is offline and you cannot communicate with it—but it is complimentary to have services for handling more coarse-grained failures, like the loss of a server supporting the software systems.
When recovering from failure, it is natural to pick up where you left off and some messages or actions may be repeated. Whenever possible, it is best to design operational technologies to be idempotent, rather than expecting reliable messaging, or hoping for exactly-once delivery guarantees.
Conclusion
In this article, I explored the tools that I view as essential for developing software for operational technologies. Actors are valuable for individually modelling and managing the state of many assets and processes, by encapsulating state at a point in time, executing state machines, and providing a reliable model for concurrent computation. Actors also provide a unit of autonomy, allowing them to be distributed across a set of computing resources. Low-latency messaging that embraces a variety of models and protocols—particularly two-way communication, event-based messaging, durable message-queueing, and publish-subscribe messaging—is fundamental. A high-level, Reactive Streams library is key for addressing the dynamics of these message-based systems, providing bounded resource-constraints and succinctly expressing the common patterns found in streaming-data systems. Finally, embracing failure as essential to the system, rather than something exceptional, leads to more robust software and design choices that more closely model the realities of the physical assets and processes that the software is monitoring and controlling.
Not all of these software systems are small, and many have challenging scalability requirements, but they are small relative to Internet-scale systems. ↩︎
I wrote an article on the most difficult bug that I ever encountered in developing multi-threaded software. ↩︎
The things that I miss the most are probably deterministic destruction and the resource acquisition is initialization (RAII) pattern. ↩︎
I recently had a discussion with someone who developed a similarly complex system using Go. While he found Go's simplicity and concurrency model appealing, he said that data races still presented one of the biggest challenges in the system. He said that if he was to start the project again, he would likely use Rust, in part due to its support for preventing data races. ↩︎
Erlang, Akka, Orleans, Elixir, and Proto.Actor are the technologies that I am aware of that support distributed, actor-model programming. ↩︎
In a discussion with a friend of mine, he noted that he liked Go's concurrency model, but since it is strictly a local abstraction, recovering state after a failure of the entire process was a challenging problem in the systems that he had worked on. ↩︎
There are certainly lots of challenging distributed-systems problems in this space, but most of us will not be working on them directly, unless we are working for the infrastructure provider. ↩︎