Human on the Loop

What AI’s Deployment at Scale Tells Us About the Future of Work — and What to Do About It

The Leading Edge

The clearest signal about where AI is going comes from where it's already operating under maximum pressure, military targeting. In recent operations, AI systems have enabled strike tempos that would have been physically impossible five years ago — not by automating the decision, but by compressing the time between sensing, reasoning, and action from hours to minutes. The systems involved are not exotic military prototypes. They are built on the same frontier AI models used by office workers, doctors, and students.

This matters because militaries are not early adopters in the casual sense. They deploy at scale only after operational logic is established. What we are watching is not experimentation — it is standardization of a new decision-making architecture. That architecture is already migrating into finance, healthcare, hiring, content moderation, and every other domain where decisions are made at volume. The military application makes visible, in concentrated form, dynamics that are arriving everywhere.

The Oversight Arithmetic

The central structural fact about AI at scale is simple and almost never stated plainly: entities cannot simultaneously maximize AI-driven throughput and maintain deliberative human oversight. Above a certain decision volume, these are mutually exclusive.

The arithmetic is not complicated. Deliberative human review — the kind that catches errors, applies context, and exercises genuine judgment — runs at tens of decisions per day per reviewer. AI systems operate at thousands or millions. Any organization claiming “human oversight” of high-throughput AI decisions is either operating on batch approval (reviewing samples, not individual decisions) or the claim is not accurate. This is not a criticism of intent. It is a mathematical constraint.

The distinction that matters is between human-in-the-loop and human-on-the-loop. Human-in-the-loop means a human must authorize each decision before execution. Human-on-the-loop means a human monitors the system and can intervene — but the system is running, decisions are executing, and the human is watching, not authorizing. Above the throughput threshold, human-on-the-loop is what you have, regardless of what the policy says.

This produces a specific failure mode: errors that are confident, fluent, and structurally plausible. A wrong output that looks right. The scale and speed of the system means those errors accumulate before any review mechanism detects them, and by the time the pattern is visible, the downstream effects are already in place. We have seen this produce tragedies. We will see it again. The question is not whether errors occur — they will — but whether the systems and practices around AI deployment are designed with that reality in mind.

What AI Actually Is

The most common framing of AI — as automation, as a tool that replaces specific tasks — is wrong in a way that leads to wrong decisions about how to use it.

The electricity analogy is closer. Electricity was not a better steam engine. It was a general-purpose medium that changed what was possible across every domain it touched. The factories that won the electrical era were not the ones who replaced their steam engines with electric motors in the same physical layout. They were the ones who redesigned the floor plan entirely around what the new medium made possible — distributing power to individual workstations, enabling configurations that steam could never reach. The transformation was architectural, not mechanical.

AI is similarly architectural. It doesn’t replace specific functions — it changes the decision-making and reasoning infrastructure underlying all functions. Organizations trying to deploy it as a direct task substitute are making the equivalent of the steam-engine swap. They will get incremental gains and miss the transformation.

The electricity analogy has a limit worth naming, because the limit is important. Electricity’s failure modes are physically legible — the light is on or it’s off, the current kills you or it doesn’t. When it fails, you know it failed. Static corrupts data in ways you can detect. Even lightning, which is genuinely stochastic and occasionally strikes twice, produces visible results.

AI’s distinctive failure is that the output looks like it worked. The response is fluent, confident, well-structured — and wrong. This failure mode does not exist in electricity, and it requires a different set of intuitions to manage. Three other analogies capture dimensions that electricity misses.

Language is the strongest. Language is itself a general-purpose reasoning and communication medium, and it has been infrastructure-transformative multiple times — speech, writing, print, digital text. It produces confident plausible errors constantly: miscommunication looks like communication until someone with domain knowledge detects the gap. Expert users of language in a domain use it more precisely and catch failures that novices cannot see. And errors compound in chains — the telephone game is exactly what happens when stochastic outputs feed stochastic outputs. Everyone already understands this intuitively. AI is language, formalized and accelerated.

Early scientific instrumentation captures something else. Early microscopes and telescopes produced artifacts — things that appeared in the visual field that were not in the sample or the sky. Expert users learned to distinguish instrument artifact from real signal. Novices could not. The instruments extended expert reach dramatically, but required calibrated expertise to interpret correctly — without it, they produced confident false observations. The AI failure mode is precisely this: the instrument shows you something, and you need domain knowledge to know whether you are seeing signal or artifact.

The printing press captures the institutional dimension. It democratized information production at scale, could not guarantee the truth of what it printed, required new literacies to navigate, and forced institutional responses — editorial standards, peer review, eventually formal publishing norms — that were post-hoc frameworks built to accommodate capability that already existed. Those frameworks did not stop the press. They structured how it was used productively. That is what is happening with AI now.

Entropy Is Structural

There is a property of current AI architecture that is underappreciated because it is often confused with quality problems — bias, hallucination, accuracy. These are real but separate. The deeper issue is structural.

AI systems generate outputs stochastically — meaning each output carries irreducible uncertainty, not because the model is poorly built, but because stochastic sampling is the generation mechanism. This is not a bug. Remove it and you remove the capability.

The critical implication is what happens when outputs are chained — when one AI output feeds another, or when an AI-generated decision becomes the input to the next stage of a process. Uncertainty compounds multiplicatively, not additively. A system with 95% output confidence feeding a downstream system with 95% confidence does not produce 95% end-state reliability. It produces 90%, and that is before any data quality or model issues enter. A five-stage agentic pipeline with individually high-confidence nodes can produce genuinely unreliable end outputs. The telephone game again: each step is plausible, the cumulative drift is substantial.

This is an architectural reality of the current generation of AI systems. Changing the architecture may shift where the entropy is generated but cannot eliminate it — the stochasticity is the generation mechanism. It is not solvable by adding more model layers or more AI review steps. More model layers add more stochastic nodes. The error compounds further.

One approach that partially addresses this is adversarial agent design: a system specifically optimized to find errors in the outputs of another system, with genuine opposing incentives rather than collaborative ones. This is architecturally different from a critic model that evaluates quality. The discriminator in generative adversarial networks worked precisely because it had incentive to find failures, not assess success. The equivalent for reasoning systems is underbuilt in enterprise contexts but theoretically sound. Where it is implemented rigorously — with genuinely different training and architecture, not the same model family critiquing itself — it reduces compounding error. It does not eliminate it.

Where the Real Value Is

Given the throughput-oversight constraint and the structural entropy problem, the deployment model that reliably produces high-quality output is straightforward: one expert, working with AI tools they have personally calibrated to their domain.

This is not a limitation to be engineered around. It is the correct unit of analysis for current AI capability. The expert provides what the model cannot: they know what a wrong answer looks like in their domain, they can detect confident nonsense, they understand which problem the model is actually solving versus the one they asked it to solve, and they catch the instrument artifacts that novices mistake for signal. The AI extends their reach — more literature surveyed, faster synthesis, lower-friction drafting, broader pattern search — without removing the error-correction mechanism that expertise provides.

The productivity math is direct. Expert with AI: substantially more output from one expert. Non-expert with AI at scale: high volume, with quality degradation that is hard to measure until it is systemic, because the errors are plausible. The compounding entropy problem means that adding more AI layers to check AI outputs does not solve this. It requires the expert at the output boundary.

This does not mean AI is only useful for experts. It means that the reliable value — the deployments where you can trust the output — currently lives at the expert-AI interface. Everything else requires structural acknowledgment that outputs need human judgment applied, and design of systems that make that judgment possible rather than nominal.

We can watch this play out in real time in software development, where AI adoption is both earlier and denser than almost any other field. The initial wave was generation — autocomplete, code suggestion, automated drafting. The field has since pivoted, visibly and measurably, toward review. AI-assisted code generation created a volume of output that exceeded human capacity to verify, and practice adapted to meet the actual constraint: not how to produce more code, but how to ensure what gets produced is trustworthy before it ships. The tools, workflows, and professional emphasis are now reorganizing around that problem. This is not a planned transition — it is the natural and inevitable correction when one methodology proves unproductive and another proves necessary. Every field will make this correction. Software is showing what it looks like when it happens fast.

What Comes Next: Practices, Codification, and Differentiation

Here is where the electricity parallel becomes most instructive — and most useful for anyone thinking about their own work.

Electricity did not arrive as a finished system with established practices. It arrived as a capability, and everything else — electrical engineering as a profession, wiring codes, safety standards, the design of electrical infrastructure — emerged afterward, built by people working out what it meant to use the medium well. Those practices then became codified, and eventually enforced. But throughout that process, the people who were developing the practices first — who understood the medium deeply and were building their own methods for using it safely and productively — consistently outperformed those waiting for the codification to tell them what to do.

The same dynamic is forming around AI now. Practices are emerging. Some will be codified into professional standards, regulatory requirements, and organizational policy. Many already are in early form. But the codified version will always lag the frontier, and the frontier belongs to practitioners who are actively working out, for their specific domain, what the medium makes possible and what it requires.

Two dimensions of this differentiation matter. The first is which tools — the specific AI configurations, prompting structures, workflow integrations, and adversarial checks that a practitioner develops for their domain. The second, and less discussed, is how they use them — the judgment about when to trust output, when to push back, what kinds of problems the model actually solves well versus where it produces confident artifact. Both dimensions are individually developed. Neither transfers cleanly through a policy document.

This means the practitioners who will define best practice in every field are developing it now, through deliberate experimentation, calibration, and honest assessment of where the instrument shows signal and where it shows artifact. That work is not exotic. It is methodical. It looks like asking, for every AI-assisted output: does this look right because it is right, or because it is fluent? Building the intuition to answer that question, in a specific domain, with specific tools — that is the practice being built. And it will differentiate performance in every knowledge-intensive field over the next decade.

The question worth sitting with is not whether AI will change your field. It will, as electricity changed every field it touched. The question is whether you are developing, deliberately, the practices that will let you use it well — or waiting for someone else to tell you what to do.

Start Here

The most direct path into this is also the most personal: take something you already know well — a subject you’ve studied, a domain you work in, a topic you follow closely — and start using AI to engage with it. Not to replace what you know, but to extend it. Ask hard questions in your area of expertise. Push on the answers. Notice where the output is sharp and where it is plausible but wrong. Notice what you had to know to tell the difference.

This is the calibration process. It cannot be shortcut, and it cannot be transferred from someone else’s experience. But it moves faster than you expect, because expertise is exactly what makes the errors visible. The friction you encounter — the moments where the tool falls short, surprises you, or requires correction — is not failure. It is information about where the boundary is. That boundary is what you are mapping.

Everyone who does this work deliberately, in the domain they know, is building something that will compound. The tools will improve. The practices built around them will not become obsolete — they will become the foundation for using better tools better. That is what it looked like to develop fluency with any general-purpose medium at the moment it arrived. This is that moment.