Version Too

ballerina

A joke is not funny if you have to explain it, but this essay is my attempt to explain a joke: a joke that appeals to my sense of humour, but a joke that also serves an important lesson in evolving software platforms.

Sounds Like Two

For over a decade, I worked on the leading historian for industrial automation. Most of the platform was written in C++. There was a utility library called piut that provided elementary classes—data types, data structures, reader-writer locks, file and network IO, and so forth. Like a lot of utility libraries, it grew organically and became a catch-all. Over time, the bloated library made extending and refactoring difficult.[1]

Driven by frustration, my colleague set out to break this library up and layer it more intelligently. It is difficult to untangle anything that has grown expediently, is used by a large number of people, and is included in a diverse array of projects. As a first step, his goal was to divide the library in two: a library of fundamental classes with few, if any, dependencies, and a second library of higher-order classes built on top of the first library. The foundational library would continue to be called piut. He called the second library piut_too.

That's right, not piut2, but piut_too. Thanks to the joys of the English language, when spoken, the “too” still sounds the same as “two”. People hated this name. Few appreciated the typographical joke. But my colleague also had a point: this wasn't another version of piut, or a new piut, it was piut too—meaning in addition, also, as well. It took me a while to warm up to the name, but eventually I didn't mind it and I loved the sense of humour.

These libraries were included in software for industrial automation, software that customers downloaded and installed on servers and workstations. Software in these environments was, and still is, very static, in part because industrial environments for manufacturing, the process industries, critical infrastructure, and so on, are risk averse, or governed by regulatory processes for change management. However, it was also due to the fact that a few decades ago, there were limited options for server virtualization, virtual networking, infrastructure-as-code, immutable deployments, container orchestration, and application monitoring. People generally installed and configured the software by hand, on bare metal, and that was that.

Due to these constraints, major changes in software platforms were literally “version two” and demanded upgrade paths that preserved data and investments customers had made in application development. There was only one version of the software installed at any one time, upgrades were generally in-place, and there was little or no flexibility for running software in parallel. Fast forward to our modern cloud-computing platforms that include infrastructure-as-a-service, platforms-as-a-service, serverless computing, virtual networking, and orchestration platforms like Kubernetes, and we can think much differently about “version two”.

Sounds Like Too

A couple of years ago, the team I work on inherited a critical API that had concerning limitations. The code was blocking and single-threaded which sometimes resulted in catastrophic service degradation when periodically refreshing security tokens. The service was written in a programming language that the team was not familiar with. It did not use our standard tooling or follow our usual conventions, making operational support difficult. Perhaps worse than no metrics, it had misleading metrics, making monitoring, alerting, and troubleshooting extremely challenging.[2] The code was dynamically typed, which lead to undesirable run-time errors on a few occasions. Finally, the service was mixing concerns between serving authorization requests and requests for IoT-device metadata.

For the purposes of this essay, I will call this service the Device API. Our goal was to eliminate some functionality entirely, while decomposing the rest of the API into two different services, one for authorization and one focused on the Device API. As part of this work, we reimplemented the service in our primary programming language, adopted patterns for asynchronous, non-blocking execution, embraced more scalable database query patterns, and exposed detailed and reliable operational metrics.

I jokingly called the rewrite “Device API Too” and named the repository device-api-too. Just like my colleague and piut_too, people were not amused with my play on words. They didn't get the joke.[3] However, in this case, the name was even more fitting. Through flexible ingress routing, this service was literally part of the Device API—in addition, also, as well as—running alongside the original service for a couple of years, as we slowly decomposed functionality, a piece at a time.

Our approach was generally: identify a route to migrate; implement the route in Device API Too; leverage flexible ingress routing to mirror all production requests to evaluate in shadow-mode; then, once we were satisfied, we used the ingress routing to cut-over all requests to Device API Too.

Rather than a big-bang rewrite, we were able to deliver value incrementally and use small experiments to inform our approach.[4] Even when some routes were only running in shadow-mode, we derived enormous value just from having rich operational metrics that were not available to us before. We were also able to innovate with Device API Too, as long as it didn't involve database schema changes that would impact the original Device API.[5]

This is a technique some call Branch by Abstraction, although this is not a term I have heard people use. The flexibility provided by modern cloud platforms makes this technique more applicable than ever, far more applicable than with the limited options we had in the industrial computing environments I used to work on. An area that is still ripe for innovation, however, is bringing the same kind of flexibility to industrial computing, especially the edge computing devices that make up the IoT.[6]

We are not quite done, but soon, once the original Device API is completely retired, Device API Too will become the Device API. In other words, we will drop the “Too”. The next time you decompose or rewrite a software platform to create “version two”, think about how you can make “version too” instead, evolving incrementally to provide value, reduce risk, and learn new things at each step along the way.[7]


  1. The original development of this library started before the C++ Standard Template Library existed. ↩︎

  2. Because of the way it used a pool of single-threaded workers, scraping metrics would query a different worker each time and intersperse discontinuous counter values from the different workers, rendering them useless. ↩︎

  3. Each time someone new joined the team, I needed to re-explain the joke. At least now I can point people to this essay. ↩︎

  4. Joel Spolsky in Things You Should Never Do, Part I argues for an incremental approach to evolving software, rather than a rewrite from scratch, which was the approach we took. See this thread from Michael Feathers for some provocative ideas on rewriting software related to legacy code, organizational change, and the rate of change of each. ↩︎

  5. One of the innovations that didn't require a schema change was the introduction of fast and flexible hierarchical-search using Postgres LTREE. ↩︎

  6. For speculation on how I see industrial computing and IoT evolving, see The State of the Art for IoT. ↩︎

  7. Why the picture of a ballerina from Swan Lake? Because Swan Lake is a story of transformation. Also, “tutu” sounds like “two, too”, so why not. ↩︎