Who Watches the Watchmen? AI Code Generation and the Oversight Problem

Recently, I read an article on LinkedIn that captured something many experienced developers have been feeling: software development is changing rapidly in the age of generative AI, but not always in ways we fully understand. One quote especially resonated with me:

“An MIT professor called AI ‘a brand new credit card that lets us accumulate technical debt in ways we were never able to before.’ That credit card now writes 41% of the code.”

Whether that number is accurate or not, and whether the prediction that AI will generate 100% of code by the end of the decade comes true, the point is clear: a large and growing portion of modern codebases is now generated with the help of AI tools. This has undeniable benefits: developers can prototype faster, explore ideas more quickly, and automate repetitive tasks that previously consumed valuable time.

But this acceleration introduces a structural challenge: we are producing code faster than we can reasonably understand and validate it. Some years ago, AI systems became so complex and opaque that we started to lose track of how they think. Now that those same systems are starting to write our code, we risk not fully understanding how that code works, nor how well it really works.

The question is no longer whether AI-generated code should be used, but how we can maintain quality and reliability when the volume of generated code keeps increasing?

And increasingly, it raises a deeper question: if we get to the point that we need AI itself to review and explain AI-generated code, who watches the watchmen?

The Review Bottleneck

In theory, modern development workflows are designed to maintain quality through code review. Pull requests are reviewed by experienced engineers before changes are merged, to check that they align with the project’s standards and overall architecture, and to mentor more junior developers in those practices. Automated tests provide additional guarantees around stability, performance, and correctness.

In practice, the scale of AI-generated code is changing the dynamics of reviews.

Several studies suggest that average code reading speed is around 1000 lines of code per hour, assuming sustained concentration. That means reviewing a 1000+ SLOC pull request may already require close to an hour of focused work under ideal conditions.

Reviews rarely happen under ideal conditions: engineers are interrupted by meetings, messages, and other responsibilities. Context switching is constant. Maintaining full concentration for extended periods is difficult. And let’s be honest: code reviews are boring. They are mostly done to help our work colleagues and to decompress between more intensive and focus demanding creative tasks, usually with a cup of coffee in hand.

AI-assisted development changes the equation further: developers are now able to produce significantly more code than before. Personally, I have experienced periods where (with properly defined specs and a carefully crafted prompt) AI assistance allowed me to produce in minutes the equivalent code that would have taken me days or weeks to write manually. This is a tremendous boost in productivity, but the result is not necessarily reduced workload. Instead, the nature of the work shifts from writing code to reviewing and validating all these increasingly large volumes of generated code.

This kind of review is cognitively demanding. It becomes easy to lose focus and overlook subtle issues, especially when reviewing large blocks of code that are technically correct but not particularly expressive or insightful.

Ironically, code generated by AI often looks clean and well structured, and frequently includes tests. Coverage numbers may look better than ever (yay!), yet this apparent order can hide deeper structural issues that only become visible over time.

The Technical Debt Explosion

Technical debt has always been part of software development. Teams often accept short-term compromises in exchange for faster delivery with the intention of addressing them later.

In practice, technical debt is rarely repaid as systematically as planned.

AI-assisted development introduces a new dimension to this dynamic: the rate at which codebases grow is increasing significantly. When code volume increases faster than architectural understanding, complexity grows as well. There’s nobody left with a comprehensive view of the system as a whole who can identify flaws and decide what to simplify. All is added, nothing is removed, and the codebase grows without a clear direction. This is a perfect storm recipe for accumulating technical debt at an unprecedented pace.

This does not mean AI-generated code is inherently bad. In many cases it is perfectly serviceable and sometimes excellent. The challenge is not individual code fragments, but the long-term evolution of entire systems.

One recurring pattern is the tendency for generated code to favor local solutions rather than integration with existing abstractions or external libraries. Over time this can lead to duplication and fragmentation across the codebase. Without deliberate architectural oversight, the cumulative effect may be an increase in complexity and maintenance cost, by having in the same source code multiple implementations of similar functionality with subtle differences in how they work, and in some cases, incomplete implementations that don’t consider use cases that can fail that are already fixed in the other ones.

Maintaining long-term system coherence requires something that cannot easily be generated automatically: contextual understanding of the system as a whole. That usually comes from having worked with the system first hand, or, in especially complex cases, from having designed it from scratch. If we fully delegate development to AI, we lose both.

The Limits of Human Review

Traditionally, experienced developers provided this system-level understanding. Senior engineers and software architects accumulated deep knowledge of the codebase and guided its evolution over time. They also transferred that knowledge to new contributors and mentored junior developers.

However, as AI tools take on a larger share of code production, the role of senior developers increasingly shifts toward reviewing generated output rather than building and evolving systems directly. This shift has subtle consequences.

Reviewing code is not the same as designing it. Architectural intuition develops through direct interaction with systems: writing code, refactoring it, and understanding its behavior over time. When engineers spend most of their time reviewing large volumes of generated code, maintaining that depth becomes more difficult.

At the same time, developers themselves are becoming dependent on AI-assisted workflows. Many of us have experienced how dramatically productivity can drop when those tools are unavailable. VSCode Copilot, for example, has become an integral part of many developers’ workflows. When it was launched, it was a game-changer for many, like having a pair programmer available at all times. And with the chat functionality, it has become even more powerful, allowing us to ask for explanations, suggestions, and improvements on the fly, not to mention how good it is at generating code, especially for unit tests or with the latest models like GPT-5.3 Codex Max. But this also means that when those tools are not available, or we need to use older free models like GPT-4o or GPT-5 mini, we may struggle to maintain the same level of productivity and quality.

This dependency is not necessarily negative. Powerful tools have always reshaped software development. But it does highlight how central these systems have become to everyday work. The question is not whether AI will remain part of the development process (it clearly will), but how our practices must evolve so we know who is ultimately responsible for watching over the systems we build with it.

Rethinking the Development Pipeline

If AI-assisted development continues to accelerate, traditional workflows may no longer be sufficient to maintain system quality at scale. Manual review alone cannot scale indefinitely with code volume. Increasing the number of reviewers does not fully solve the problem if each reviewer faces the same cognitive limits.

A likely next step is the development of systems that make software evolution machine-readable and traceable by design. Instead of treating development history as a collection of commits and pull requests intended primarily for human consumption, future systems may represent changes as structured events that can be analyzed automatically.

Such systems could provide:

Fine-grained traceability of how code evolves over time
Machine-readable records of AI architectural decisions
Automated detection of structural inconsistencies
Continuous monitoring of technical debt accumulation
Reproducible histories of system evolution
Study alternative implementations and their trade-offs over time
Review and maintain legacy code, looking for silent bugs because “it worked”
Propose refactoring and simplifications based on long-term trends
Generate documentation that captures architectural intent and rationale
Constant monitoring of code quality metrics and architectural consistency across the codebase
AI-assisted identification of potential issues before they become critical, by analyzing patterns in code changes and their impact on system behavior

In such environments, AI systems would not only generate code, but also help supervise and evaluate it in a consistent and auditable way.

In short, a system like this would not replace human judgment as other trends are showing. Instead, a system like this would allow humans to focus on higher-level creative and reasoning tasks by providing them with more informed decisions, while low-level automated systems would do the repetitive tasks that nobody wants to do.

The Watchmen, Redefined

Maintaining long-term development quality will demand teams who not only understand how to prompt an AI model but also how to test, reason about, and simplify what it produces.

That’s where the craft of experienced engineers remains essential: ones thinking carefully about how to build sustainable systems in this new AI-driven reality. At WebRTC.ventures, and among developers who share this mindset, the goal isn’t just to accelerate delivery, but to ensure that the systems we create remain understandable, reliable, and maintainable as automation scales up. In the end, you still need humans to decide what “good” looks like.

Who Watches the Watchmen? AI Code Generation and the Oversight Problem.

The Review Bottleneck

The Technical Debt Explosion

The Limits of Human Review

Rethinking the Development Pipeline

The Watchmen, Redefined

Production Voice AI Architecture for Regulated Industries

Watch WebRTC Live #111: Improving End-to-End Quality with WebRTC Observability

Bedrock vs Vertex vs LiveKit vs Pipecat: Choosing a Voice AI Agent Production Framework

QA Testing for AI Voice Agents: A Real-Time Communication QA Framework

Recent Blog Posts

Production Voice AI Architecture for Regulated Industries

Watch WebRTC Live #111: Improving End-to-End Quality with WebRTC Observability

Bedrock vs Vertex vs LiveKit vs Pipecat: Choosing a Voice AI Agent Production Framework

QA Testing for AI Voice Agents: A Real-Time Communication QA Framework

We’re one of the few agencies in the world dedicated to WebRTC development. This dedication and experience is why so many people trust us to help bring live video application dreams to life.

Let's get started!

Contact us today

Join our mailing list!

Categories

The Review Bottleneck

The Technical Debt Explosion

The Limits of Human Review

Rethinking the Development Pipeline

The Watchmen, Redefined

Recent Blog Posts

Recent Blog Posts

We’re one of the few agencies in the world dedicated to WebRTC development. This dedication and experience is why so many people trust us to help bring live video application dreams to life.