Adapting to AI: Code Review

Adapting to AI: Code Review

In an earlier essay, I shared my reflections on the rapid transformation of software engineering by AI. In this essay, I will explore specific challenges related to code review.

Code review has been a core part of software engineering for decades, used to refine design, improve quality, share information, document changes, and satisfy regulatory compliance. AI systems can now rapidly produce far more code than any individual or team can review, and they can work in parallel, around the clock, while we sleep. There will never be enough hours in the day for a human to even attempt to review all the code AI writes. This is a major shift, and it is disruptive.

It is inevitable that the new challenges introduced by AI authoring the majority of software will also be solved by AI.

Compliance

Most organizations use code review, including the review of software tests and documentation, as a cornerstone to meet legal requirements, audit frameworks, or standards, like ISO, IEC, NIST, SOC, HIPAA, CRA, MISRA, or your acronym of choice. This means code cannot be incorporated into the product or deployed to production without someone reviewing it—all of it. This creates a difficult situation at the moment: AI tools are helping the organization move much faster, but the business is ultimately constrained by the rate at which humans can review code.

I expect audits will eventually change from an auditor manually verifying that testing and code review were adequately performed, to simply verifying that a set of AI tools were used to audit code, configuration, and tests. I expect many of these standards and frameworks will evolve to require specific AI models as controls that will ultimately provide significantly more standardization and rigor than exists today. But for now, we remain directly accountable for software we author and review.

What can we do in the meantime? We need to work in logical increments that are still reasonable for humans to review. Pull requests with fifty-thousand lines changed are just not practical to review.[1] We can use AI to help review code, running automatically as part of the pull request. AI is very good at triaging issues and spotting inconsistencies. People can also use AI to ask questions of the code they are reviewing to understand it better. Finally, I think pairing, or working as a team, to agree on requirements and prompt the AI together to shape the end result makes sharing context and ultimately reviewing the code a lot easier.

Knowledge Sharing

Another important function of code review is to share knowledge in a team. How are systems changing? What new tools or techniques are being adopted? What is being deprecated? What practices and standards are we agreeing on?

I know teams where everyone on the team is expected to review all pull requests, even ones that were merged while they were on vacation, to ensure everyone understands the code and how the software is evolving. Reviewing every pull request on large teams was always difficult, but at the rate AI can produce work, it is also becoming difficult on small teams. Again, AI is a great tool for asking questions about pull requests and explaining them, so it can help summarize and keep you up to date with changes, or help build your mental model during production incidents. It even feels less daunting to lose someone on a team without a thorough knowledge transfer, or to inherit a legacy codebase. AI is great for research and discovery of existing code, documentation, and tests—or lack thereof.

I don't know of a team that has landed on a perfect process for knowledge sharing, and I expect teams to continue experimenting as AI tools keep changing. I know of teams storing specifications and prompts in source control, in addition to code and tests. Smaller teams seem to be adapting better than larger ones. And despite the new tools, nothing seems to beat regularly meeting as a team to talk through design choices and trade-offs.[2]

Quality

Code review is used to improve software quality. It is used to ensure conventions are followed, comprehensive testing is included, edge cases have been evaluated, and alternative approaches have been considered. Agents are now very good at all these things, and they are only getting better.

Code review is often used to agree on the right abstractions. Could this code be simpler? Could it be more testable? Could it be more general and reused elsewhere? Should it be in a shared library? The current AI models sometimes explicitly satisfy constraints while missing abstractions we would consider good engineering practice. When reviewing code, we can prompt the AI to tease out these abstractions, but it makes me wonder: if AIs are going to maintain the code, not us, and the software meets the requirements in terms of cost, performance, reliability, security, and functionality, maybe we will care less about abstractions? Perhaps even concrete implementations and fewer dependencies are more straightforward? With code being faster and easier to generate, will we be less concerned with code reuse and abstraction?[3]

Generally, techniques like formal verification, model checking, undefined behavior analysis, fuzzing, property testing, and deterministic simulation testing are only used by elite software teams.[4] The average software organization has not invested in these techniques and I'd venture the average software engineer can't articulate the value they provide. Traditionally, these techniques to improve software quality, security, and correctness required a significant investment, but AI is now lowering the cost. I have even seen AI improve the quality of documentation and testing from teams who paid less attention to these practices in the past, just because it is now easier to include documentation and testing as part of the work.[5]

A challenge with AI is defining a specification which it can use to autonomously verify its work. Rather than a traditional code review—that has code with unit and integration tests—there is a tremendous opportunity to guide the AI with formal verification, deterministic simulation tests, and more, to deliver code of much higher quality than most organizations have ever been capable of. It used to feel like investing in these practices would slow teams down, at least in the short term, but now it may be essential to speeding them up.

To deal with the probabilistic nature of AI, we should focus on determinism to define success. With this rigor, the AI won't just write the code you asked for—it will autonomously write the code that satisfies the formal specification. I'm very interested to see where this goes and I'm eagerly following people experimenting with these techniques.[6]

Open Source

When I speak with maintainers of popular open-source projects, they all tell me they can no longer keep up with the rate of AI-generated contributions. They simply can't review this much code. For the foreseeable future, there may be more open-source software than ever, but the most popular projects will be maintained by a core group who hold the conceptual integrity of the project and are open to feature requests, but not outside contributions.

Security

Finally, reviewing code from a security perspective is one of the most valuable aspects of code review, especially in regulated industries or in critical infrastructure. While security review by humans is not going away just yet, I have seen great value from AI tools in identifying or triaging critical vulnerabilities. Furthermore, instead of performing specialized security evaluations on an occasional basis, like penetration testing, AI can be used to make these evaluations on a regular basis. Google, Anthropic, and others are starting to develop specialized security models.[7]

I expect security review by a single agent or model will be unacceptable. Models will have limitations and there will inevitably be attacks, like prompt-injection, used to influence the outcomes. Organizations will perform security review using multiple models, from different vendors, running under a variety of configurations. I also expect some security evaluations to happen in hermetic environments with static models. This will ensure agents cannot reach out to the internet, or other compromised enterprise services, and be influenced, or socially engineered, by an attacker.[8]

Even with great AI security review, for the foreseeable future, we will want human review of important things like who is authorized to do what in the system. It will be important to build in a small number of choke points, like the authorization system, or data governance, where human review remains effective. Furthermore, defense in depth has always been important, but for systems security in a world where discovery, development, and exploitation are all accelerated by AI, it becomes by far the most important thing to review.[9]

In Review

Code review remains an important part of the software development process, but AI is rapidly changing it. For organizations bound by regulatory compliance, a human is still required to review every change that makes its way into the product or service. This becomes a bottleneck, and it won't change until the laws, regulations, and standards catch up with AI tools. However, there is also an opportunity to use AI to greatly improve the quality of software by reducing the cost of investments in methods like formal verification, model checking, and deterministic simulation testing. More regular and rigorous security reviews will also improve quality. We need to actively shape industry standards and regulation in this direction. If you work in the software industry and you are not familiar with these methods, it is time to learn them. They won't just be the way code is reviewed and quality is ensured—they will likely be how we instruct AI to generate correct software.


  1. Making smaller pull requests is not necessarily the solution, since AIs can generate small pull requests so quickly that people would still be the bottleneck. Furthermore, we want AIs to be working constantly, overnight and on the weekends. Inevitably, they will have to work in large increments. This is why AI review becomes essential. ↩︎

  2. Joran Dirk Greef told me how the TigerBeetle team sometimes does walking design reviews where everyone is on the phone, listening and holding the design in their head, while also walking. The movement and mental focus, undistracted by slides or body language, likely helps with concentration, active listening, and creativity. It reminds me of why Indi Young likes to do what she calls listening sessions over the phone so she is not distracted by facial expressions or body language. I wrote about this in An Interview as a Listening Session. ↩︎

  3. When I first started programming professionally, most of it was in the C programming language. Because C didn't support templates—or generics, as they are called in some languages—people would have to implement the same data structures, like a linked list, for each data type. This was tedious to maintain, which is why templates in C++ were such a popular feature. A recent paper entitled Code for Machines, Not Just Humans: Quantifying AI-Friendliness with Code Health Metrics found that AI could more effectively refactor Python code that was well-factored and modular, suggesting human-maintainable code is also superior for AI, at least with the current models. ↩︎

  4. AWS has made extensive use of automated reasoning. For example, see How automated reasoning helps Amazon S3 innovate at scale, AWS team wins best-paper award for work on automated reasoning, and An unexpected discovery: Automated reasoning often makes systems more efficient and easier to maintain. Mai-Lan Tomsen Bukovec discussed How AWS S3 is Built on The Pragmatic Engineer Podcast. TigerBeetle is well known for deterministic simulation testing. See the blog A Descent Into the Vörtex, and the talk TigerStyle! (Or How To Design Safer Systems in Less Time) by Joran Dirk Greef. For some of my thoughts on software reliability before the widespread use of AI, see my two essays On Eliminating Error in Distributed Software Systems and On Embracing Error in Distributed Software Systems. ↩︎

  5. I'm still surprised when I encounter people who think software testing—or at least integration testing or systems testing—should be someone else's job. With the importance of letting AI verify its work, it becomes even more important to have automated integration tests. For anyone still unconvinced of the value of continuous integration, I encourage you to read Dave Farley's excellent book "Modern Software Engineering". I think these practices are only more valuable in the age of AI. ↩︎

  6. Debasish Ghosh has been experimenting with both formal specifications and deterministic simulation testing in combination with agentic coding. See this tweet for more information. Listen to this podcast with Will Wilson from Antithesis (a company that provides a special hypervisor for deterministic simulation testing) for an in-depth discussion of deterministic testing techniques and how they are changing with AI: Why Testing Is Hard and How to Fix It. ↩︎

  7. Google's Big Sleep, a framework for LLM-assisted vulnerability research, found an exploitable SQLite vulnerability that 150 CPU-hours of fuzzing missed. ↩︎

  8. This will be somewhat similar to how some organizations perform reproducible software builds today to ensure their build infrastructure has not been compromised. See the S4 talk SUNBURST From The Inside and the AWS blog Establishing verifiable security: Reproducible builds and AWS Nitro Enclaves for more information. ↩︎

  9. OT networks, which include ICS, SCADA, PLCs, and related systems in manufacturing, energy, utilities, and critical infrastructure, often avoid widespread use of end-to-end encryption along with other defense-in-depth techniques common in IT environments. These environments have favored physical security, reliable operations (e.g., no certificate expiration), and the ability to inspect the unencrypted traffic. I wonder if the ease with which AI can exploit such environments will shift the focus to more IT-like defense in depth. If nothing else, I expect more of these systems to be air-gapped. See the following for a perspective from Adam Crain on securing DNP3, network segmentation, and application-level authorization: The Case Against DNP3 SAv6 and AMP. ↩︎