I have worked on infrastructural software my entire career—mainly streaming-data systems for industrial applications. Infrastructural software is most often foundational, supporting other services and applications, rather than being the product itself. This can make it difficult to communicate the current state of the software, as well as its evolution, to people who are not intimately familiar with it. Generally, when a new feature is added to a customer-facing application, it is easy for colleagues and customers to appreciate the changes in the application, as well as the quality of the implementation. Infrastructural software, on the other hand, could undergo significant transformations with no appreciable changes to customer-facing behaviours. For example, a team might make major investments to improve the scalability of a service in aggregate—to support anticipated growth—but do so without changing the immediate experience an individual customer might have. Investments to improve security, testing, high-availability, monitoring, and deployment, can also fall into this category.
Since becoming a team lead, I have been looking for ways to communicate two things regarding the software our team is responsible for. The first is the current state of the software. I want to communicate the quality of the system in terms of business risks—the reliability, scalability, and security of the system, as well as how receptive the software is to change, in order to meet future business needs. A legacy component that lacks test converge, or developer expertise, could be extremely difficult, risky, and costly to change, whereas a well-designed, well-factored application, with good test coverage and developers familiar with the code, might be evolved reliably and efficiently. In addition, if a legacy component is effective in serving its current function, people often mistakenly assume it can be extended to support additional functionality, when this may not be the case. The second thing that I would like to communicate is the evolution of the system—essentially the product road map—highlighting where we are making investments and how the system is changing over time. I want to communicate these two aspects effectively within our team, and, equally importantly, externally, to business stakeholders, product managers, and colleagues that develop applications on top of our systems.
Inspired by Edward Tufte's approach to rich visual presentations of data, a former team lead of mine used a stacked bar-graph to communicate where investments have been made, where investments are currently being made, and where investments will be made in the near future, as well as how close the project was to overall completion.
This was a really effective technique for communicating with stakeholders on the project, as well as with us on the team. So much so, that subsequent team leads adopted the same approach. I considered using this approach—and I will use it under some circumstances—however, this approach fails to address where investments are not being made, nor does it highlight the overall quality of the software system, or notable business risks.
A few months ago, I watched Michael Feather's excellent presentation entitled Conway's Law and You: How to Organize Your Organization for Optimal Development. He covers many interesting topics related to communication within organizations, including the idea of using quality views to communicate software quality, particularly to non-technical people. He gives an example of a class-coupling diagram, contrasting a very large and coupled module with a module that is small and lacks coupling. He describes how even someone who is not a programmer can understand that making changes in the highly-coupled system will be more difficult, take longer, and present more risks, than modifying the less-coupled system.
Inspired by this idea, I wanted to experiment with extending it to combine 1) our architecture diagram with 2) Tufte's dense visual presentation of information and 3) my colleague's stacked bar-graph approach, to show both the quality of the components and the evolution of our system. The rest of this article will chronicle my approach and experiences.
Consider the following fictitious IoT system as the foundation for developing quality views. The system is comprised of two load-balanced endpoints for data ingestion, one supporting the majority of clients and the other supporting a small number of legacy clients; a number of stateless web servers; a durable message-bus; an application that parses incoming messages and stores sensor data in a database; and a set of load-balanced servers serving an HTTP API for this data. The system also includes two legacy applications, one for performing a data mapping and another for storing a subset of production data in an external business system. The load balancers, the durable message-bus, and the database are either open-source technologies, or services offered by a cloud-computing platform. The rest of the components are developed and maintained by the software development team.
The sizes of the elements represent the relative size of the servers in terms of memory, cores, and IO capacity. For example, the servers used for the load balancers for the legacy clients are smaller than the servers used for the load balancers for the regular clients. The directed line-segments represent the direction of data flow, as well as the volume of data—essentially the number of messages per second—based on the thickness of the line. The dashed lines represent security trust-boundaries with external systems.
This diagram is a rich vehicle for communication in that it describes the components that comprise the system, detailing what is inside the domain, what is outside the domain, what the interfaces are, how many servers there are, which servers are redundant, which ones are not, the magnitude of the data flows, and so on. This visual representation can become even richer, however, with the addition of information related to software quality and how the system is evolving.
To evaluate the quality of this software infrastructure, I chose to evaluate each component based on the following criteria. For a different system, you might choose a different set of criteria. Considering all of these criteria provides a broad perspective on the quality of the system. For example, the development team might think the code is in really great shape, but if the application can't be reliably deployed to production, or if an alert is not generated when the application is failing to support its business function, then the overall quality is poor.
- Is the code itself in good shape? Is it well factored? Is it maintainable? Is it reasonably easy to extend the application to add new functionality? For a component that the development team does not maintain directly, like an open-source database, this criterion can be used to judge how well the component is supporting the needs of the product, both in terms of functionality and reliability.
- Does the component have tests to support making changes without introducing regressions or undue risk? Are the tests repeatable and reliable, and can they be executed in a reasonable time-frame? Do the tests include the performance, scalability, reliability, and security requirements of the component?
- Can the component be deployed in an efficient, reliable, and repeatable manner, to support development, testing, and production environments?
- Is monitoring available such that someone developing, troubleshooting, or operating the system has insight into its behaviours? This should include both expected behaviours and unexpected behaviors. Monitoring might include, among other things, logging, real-time metrics collection (e.g., connection count or query latency), dashboards, notifications, and support for ad hoc data exploration.
- If the component is not serving its business function, is an alert generated such that someone can take action, in a timely manner, to rectify the situation?
- Does the component have security considerations consistent with business requirements? Has a threat model been developed for the component? Are there tests, or independent verification, of the most critical security requirements?
- Is the component designed for high-availability? Is it redundant, replicated, or load balanced where appropriate? Are there backups and have procedures to restore from backup been tested?
- Can the component be scaled-up to meet dynamic loads, or growing demand, by adding more compute resources or instances?
- Does the component carry significant business risk? This might include code that was developed by an individual that has departed the company that no one else is familiar with; code that is not in one of the primary programming languages used by the development team; a component that lacks security measures; a component that lacks redundancy for high-availability, or a scale-out strategy to meet growing demand; or a component where the performance and scalability limitations are untested and, therefore, unknown.
Using these nine criteria, I developed a quality view for the system by scoring each component, one point for each category it satisfies, and mapping this score to a colour gradient. If a component satisfies none of the criteria, it scores 0 and is represented by dark red. If the component satisfies all of the criteria, it scores 9 and is represented by white.
For the purpose of demonstration, imagine this system evolved from a proof-of-concept supporting a speculative business venture. Now that the venture looks promising, we want to formalize this system into something more reliable and sustainable. The initial quality view is as follows.
All components lack monitoring and alerting. Most components have no tests—there are only a few end-to-end system tests, making it difficult to evolve components reliably or independently. Most components cannot be deployed in an automated fashion, making deployments burdensome, slow, and brittle. There are major concerns around the scalability of the data-processing pipeline and the database, but without being able to monitor these components, it is difficult to even construct performance and scalability tests.
After focusing on automating deployments, adding facilities for monitoring and alerting, and improving automated test coverage, the overall quality of the system improved significantly.
There are components that did not evolve at all, however, most notably the Data Mapping service and the Legacy Data Logger. Both of these components were developed in a programming language and on a platform that the business no longer wants to use. These prototype applications were not developed with monitoring or testing in mind. The decision was made not to retrofit these two applications for monitoring or alerting and accept the risks. The quality view clearly highlights this decision. Characterization tests were developed for the Data Mapping service, since it will eventually be replaced with a new service that will perform a similar function. The Legacy Data Logger will eventually be eliminated, so no investments were made with regard to testing this component.
The quality view invites a holistic view of the system. When I presented a similar scenario at a quarterly-planning meeting, our Director of Product Development remarked: "How come those two components are still so red!?". A constructive discussion followed, where we all came to the agreement that investing in these components would not be cost effective. We were comfortable accepting the short-term risks. This discussion wouldn't have taken place if I only presented the things that we were working on. Highlighting what we were not working on was key to having this discussion. Instead of these risks being ignored, or only understood by our team, they were considered explicitly, by everyone, and it helped align everyone's understanding of the system.
The most significant risks in the system remained in the data-processing pipeline. The Data Logger is not redundant and the scalability of it is uncertain. The database has not been meeting business requirements in terms of reliability or query performance. After an iteration investing in improvements to these two components, the system has evolved to the following.
The Data Logger has been significantly refactored and improved. It is now redundant, to support high-availability. The Legacy Data Logger has been entirely eliminated, with its functionality subsumed by the Data Logger. Two servers have been added to the database cluster, increasing the number of servers from three to five, and, in addition, each server has more computing resources in terms of memory, CPU, and disk. There remains some uncertainty and business risk with regard to the scalability of the database in the long-term. Additional investments will need to be made to address this.
The quality of the system is much improved from the original quality view. It is clear that future investments should be made in understanding the overall scalability of the data-processing pipeline and developing strategies for scaling it as the business needs grow. The legacy Data Mapping application also needs to be reimplemented in a more suitable framework. The characterization tests developed for it earlier will support this work. We might also investigate the possibility of eliminating the endpoint for legacy clients, to reduce development and operational costs and provide a better customer experience.
Observing these improvements in quality is satisfying and can fuel pride-of-workmanship within the team. I had a teammate remark to me "I'm really interested to see how our quality view has improved with all of our recent changes!". Rather than just knowing that we've done a good job and made significant improvements, it is nice to be able to reflect on these milestones and highlight them to others.
My initial approach to using colours to represent quality was to use different shades of red and green, to represent poor quality and good quality, respectively. This highly-contrasted approach, however, was challenging to digest visually. In addition, it can be difficult for someone who is red-green colour-blind to interpret. I switched to using sequential shades after finding this excellent website for designing a colour scheme based on the nature of the underlying data. The sequential shades approach also has the advantage of working in grayscale, if one cannot use colour.
For the components that have a logo, like an open-source database, I use the logo, rather than text, to label the component. This makes the visualization even more rich, and also easier to interpret, since it can be digested symbolically. I did not use this approach here, however, as I wanted this presentation to be technology agnostic. One might also consider using colours for the directed line-segments to communicate the quality of the protocols, interfaces, or APIs used between services, but I have not yet experimented with this approach.
The risks category is useful for considering and highlighting business risks. Given that it is only one of nine criteria, however, it can misrepresent components that pose significant business risk. For example, a component that lacks security considerations, but is otherwise in good shape in terms of the code, testing, deployment, etc., could be seriously misrepresented. In these cases, it was useful to express the risks in other criteria as well, to produce a more realistic representation of the overall quality. For a component that lacks security considerations, I would represent the code, testing, deployment, monitoring, and alerting as all being of poor quality, since none of these elements include security considerations and will require investments. This would represent the overall quality as a 2, rather than a 7.
When presenting a quality view for discussion, it can be useful to have a supplemental table, listing each component against the evaluation criteria, with more detailed notes, so that people can refer to it and understand the categorization of each component.
I have presented an approach for developing quality views by animating a typical architecture diagram with a graduated colour scheme, to represent software quality and show changes to the system over time. This approach offers a very dense visual presentation of data. Quality views are somewhat subjective, but I have found them effective for representing the system holistically, describing where we are making investments, highlighting risks, and demonstrating how the system is evolving. They have been useful for communicating within our team, as well as externally, with both technical and non-technical stakeholders. They have been invaluable for aligning our mental models of the system.