I am hoping to attend the 19th International Workshop on High Performance Transaction Systems (HPTS) in October, 2022. I am sharing my one-page position statement and biography in the hopes that it might inspire others, even if my application is unsuccessful. If you are motivated to work on these challenges, my team is hiring software engineers, systems reliability engineers, and program managers. My thanks to Percy Link for helping me improve this proposal.
The world is undergoing an urgent transition to renewable energy to address climate change, improve human health through cleaner sources of energy, and provide energy security in the face of unpredictable geopolitics, natural disasters, and other emergencies. But the cleanest forms of renewable energy, the sun and the wind, are intermittent. Periods of high solar and wind power generation do not necessarily align with periods of high demand, making energy storage a critical function to control. Moreover, wind, solar, and battery storage are massively distributed. Instead of a few central points of control (e.g., a hydroelectric or nuclear power plant), there are now millions of points of control. The transition of transportation to electric vehicles introduces millions more points of control. These distributed energy resources can help balance the grid under renewable intermittency, and even deliver superior customer experiences, if they can be flexibly and reliably controlled. These challenges must be solved with software, making software essential to our transition to renewable energy.
With millions of points of control and a mix of cloud and edge computing, the renewable energy grid is a massive, distributed computing platform with unique challenges. First, there is a tension between central control and local operation. Local control is required to rapidly respond to changing loads, coordinate among devices, and to operate autonomously when isolated from the grid. Central control is required to meet systems-level objectives, like balancing supply and demand, or communicating price or weather forecasts. Second, these systems are stateful and transactional. For example, message loss is unacceptable for billing electric vehicle charging or settlements for energy-market participation. In such a widely distributed system, knowledge is only eventually consistent, and there are many points of failure, introducing inevitable uncertainty among local and global controllers. Finally, these systems are critical infrastructure. They are operational technology (OT), rather than informational technology (IT). If energy production, storage, or vehicle charging networks are disrupted for a brief period, the impact is severe. Beyond inconvenience, it could cause loss of life. Without power, nothing else works: communications, transportation, agriculture, healthcare, manufacturing. This raises unique systems reliability and cyber security challenges.
Perspectives I want to share:
- Motivate the most critical, challenging, and unsolved problems in large-scale distributed computing and the Internet of Things (IoT) through our transition to renewable energy.
- Share my experiences developing and operating cloud and edge software for the real-time monitoring and control of critical infrastructure, including at IoT scale.
- Promote the use of actor model programming for concurrency, distributed computing, and durably modelling the state of entities, devices, transactions, or workflows through digital twins.
- Reflect on the effectiveness of functional programming, immutable messaging, eventual consistency, streaming data, Reactive Streams, operationalizing machine learning, and a systems engineering approach to observability.
- Argue that asset management at scale is the hardest problem in distributed computing and IoT.
Directions I am excited to explore:
- The Rust programming language, since it provides memory and thread safety, functional programming, and native performance. I am amazed C and C++ are still used so extensively, especially for edge computing (e.g., firmware).
- WebAssembly (wasm) and the ability to deploy small, binary code (e.g., pure, serverless functions; or moving code from cloud to edge), run it securely, and decouple business logic from infrastructure.
- Consequence-Driven Cyber-Informed Engineering (CCE), an approach to cyber security that assumes critical infrastructure will be compromised and uses consequence prioritization and engineering mitigations, rather than cyber hygiene, for securing critical infrastructure.
- Reliably and securely managing state between edge and cloud, including representing uncertainty.
I lead the cloud platforms organization for Tesla Energy developing real-time services and critical infrastructure for power generation, battery storage, vehicle charging, and grid services. Over the past five years, I have seen these platforms grow from their infancy to become the largest and most integrated platforms for distributed, renewable energy in the world. My presentation Tesla Virtual Power Plant details the architecture and technologies used in this large-scale, distributed IoT platform, including some of the unique grid services it can provide.
Before joining Tesla, I worked at OSIsoft developing real-time infrastructures for the monitoring and control of industrial applications. This included developing a distributed time-series database that could scale to millions of series and millions of transactions per second, and publish-subscribe services for messaging among distributed, event-based applications. This software is widely used for critical operations in the process industries, manufacturing, and for power generation, transmission, and distribution.