AI Returns and the Judgment Constraint: Evidence, Mechanism, and Structural Limits


Thesis: AI systems function as pattern-distribution mechanisms. Their economic returns are determined by the judgment requirement of the work they are applied to. Where judgment is low and patterns are well-represented in training data, returns are measurable and real. As judgment requirement increases, returns fall toward zero and go negative depending on the cost of applying that judgement. AI net return is the human production cost minus the combined cost of AI generation, verification, and expected error — all of which scale with the judgment required to apply the output productively.

Net AI Return = Human Production Cost − (AI Generation Cost + Verification Cost + Error Cost)

This boundary is structural, not a product of current model limitations that future improvements will resolve.


The Evidence

The strongest available work consists of randomized controlled trials and large-scale field studies conducted between 2023 and 2026 across software development, customer support, and professional knowledge work. The following findings are drawn from peer-reviewed publications and working papers from NBER and comparable institutions.

Customer support. Brynjolfsson, Li, and Raymond studied 5,179 customer support agents at a Fortune 500 software firm through a staggered AI deployment (Quarterly Journal of Economics, 2025). Productivity, measured as issues resolved per hour, increased 14% on average. Novice and low-skilled workers gained 34%; experienced and highly skilled workers gained near zero. The mechanism: the AI encoded best practices from high-performing agents and made them accessible to workers who hadn’t had time to acquire them through normal experience accumulation.

Software development: task completion. Cui, Demirer, Jaffe, Musolff, Peng, and Salz ran three RCTs across Microsoft, Accenture, and a Fortune 100 firm with 4,867 developers using GitHub Copilot (Management Science, 2026). Completed tasks increased 26.08%. Less experienced developers showed higher adoption rates and larger gains. The same skill-inversion as Brynjolfsson: the workers who gain most are those with the most to receive from pattern distribution.

Software development: pipeline attenuation. Demirer, Musolff, and Yang tracked more than 100,000 GitHub developers matched with AI usage telemetry across three generations of tools: autocomplete, interactive agents, and autonomous agents (NBER Working Paper 35275, May 2026). Autonomous agents increased lines of code by 741% and commits by 180%. Across the production hierarchy, those gains attenuated to 50% for completed projects and 30% for actual releases. Across four major app marketplaces, new app releases rose; total app usage did not. The estimated elasticity of substitution between AI and human stages in the production chain is 0.25 — the stages are strong complements, not substitutes. Each downstream stage where human judgment re-enters discounts the upstream AI gain.

Software development: experienced developers. Becker, Rush, Barnes, and Rein ran an RCT with 16 experienced open-source developers on 246 tasks (small sample) in mature projects where developers averaged five years of prior experience, using February–June 2025 frontier tools (arXiv:2507.09089, July 2025). AI tools increased completion time by 19%. Developers predicted a 20–24% speedup; economics and ML experts predicted 38–39%. The slowdown held across 20 analyzed properties of the setting. A February 2026 METR follow-up (metr.org/blog/2026-02-24-uplift-update) found that developers increasingly refused to submit tasks they would prefer to do with AI, making clean measurement harder — behavior that is itself consistent with the judgment-constraint mechanism: practitioners learning to route AI toward pattern-accessible work and resist doing that work without it. The original −19% finding held for the same developer pool in the follow-up (−18%, overlapping confidence interval). The result inverts the pattern from Cui et al. because the work type is different: not bounded code completion in standard enterprise tasks, but complex architectural and integration work in mature codebases where the primary input is experienced judgment about deep context.

Knowledge work: the capability boundary. Dell’Acqua, Mollick, and colleagues ran a pre-registered experiment with 758 BCG consultants on real consulting tasks (experiment conducted 2023 with GPT-4-class models; the jagged frontier has expanded since, but the structural finding on its invisibility to workers holds) (Organization Science, March 2026). For tasks inside the AI’s capability frontier: 12.2% more tasks completed, 25.1% faster, 40% higher quality ratings. For tasks outside the frontier: performance dropped 19 percentage points versus the no-AI control group. The critical finding: workers could not reliably identify in advance which tasks were on which side of the frontier.

Knowledge work: collaboration. Dell’Acqua, Ayoubi, Lifshitz, Sadun, Mollick, and colleagues ran a pre-registered RCT with 776 professionals at Procter & Gamble on real product innovation challenges (NBER Working Paper 33641, March 2025). An individual with AI access matched the performance of a two-person team without AI. AI also collapsed functional silos: R&D and Commercial professionals, who diverged systematically without AI, converged in their outputs with it. For top-decile performance, the combination of human team plus AI outperformed all other configurations.


The Pattern

Across all six studies, a single structural pattern appears regardless of domain.

Returns are largest where the work component is pattern-accessible: a known problem type, a well-represented solution in the training distribution, a verifiable output. Returns invert where the work requires judgment: novel context, high verification cost, domain-specific integration, edge cases poorly represented in training data.

The skill-inversion finding — junior and lower-skilled workers gaining most in most studies — is a direct consequence of this. AI distributes the tacit knowledge of high performers to those who haven’t had time to acquire it. This functions where that tacit knowledge is encodable from training data: scripted responses, code patterns, standard analytical frameworks. It stops functioning where the high performer’s contribution is judgment about non-standard situations.

The pipeline attenuation result quantifies the same constraint from the production side. Each stage in the pipeline where human judgment is required functions as a discounting mechanism on AI-generated throughput. The 0.2 elasticity of substitution is measuring the judgment tax on AI output at every downstream stage: more code produced, same code adopted, because generation was never the binding constraint.

The P&G top-decile finding and the BCG individual-plus-AI finding are the same result seen from two directions. AI expands the generative and synthesis capacity of a single worker to approximate a team. But reaching the top of the output distribution still requires the evaluative judgment that the team provided — which is why human team plus AI dominates every other configuration for breakthrough output.


The Mechanism: Pattern Distribution

AI systems are trained on past human outputs. What they learn is the statistical structure of those outputs — patterns, associations, response sequences, solution templates — across a very wide distribution of domains. When a worker faces a problem well-represented in that distribution, the AI can retrieve, synthesize, and present a relevant pattern faster than the worker could construct it from scratch.

This is what the tacit knowledge distribution finding describes in practice. Senior customer service agents accumulate, over years, an effective inventory of responses to common problem types. The LLM has seen the statistical structure of millions of such interactions. It makes the senior agent’s pattern inventory accessible to the junior agent without the junior agent having to acquire it through experience. The productivity gain accrues to the junior agent; the senior agent was already operating at the knowledge level the AI makes available.

The same mechanism explains the Becker et al. slowdown. For experienced open-source developers on mature projects, the AI has no relevant deep pattern to distribute. The task requires context about the specific codebase, specific design decisions, specific technical debt — context that doesn’t exist in the training distribution. The AI generates statistically plausible code; the developer must validate it against deep knowledge the AI doesn’t possess. The verification cost exceeds the drafting time saved.


The Judgment Boundary

Judgment, as the term applies here, refers to the capacity to assess novel context, assign appropriate weight to incommensurable considerations, and produce reliable decisions in situations not well-represented in any training distribution.

Every study converges on this as the residual. Brynjolfsson’s senior support agents contributed judgment about edge cases the AI couldn’t model; their productivity was unmoved. Demirer’s pipeline discounts AI-generated code at every human review, integration, and deployment stage. Becker’s experienced developers slowed down because their work was primarily judgment about complex context the AI produced plausible-sounding but often incorrect outputs about. Dell’Acqua’s consultants degraded on outside-frontier tasks — tasks where the expected result required judgment the AI was systematically mis-calibrated to provide, and where workers deferred to it anyway.

Call RR the net return, ChC_h the human production cost AI replaces, CvC_v the verification cost, and CeC_e the expected error cost; JJ denotes judgment requirement throughout.

The model maps directly to each finding. In Brynjolfsson, senior workers have low ChC_h and high CvC_v — their expertise constitutes the verification capacity — yielding R0R \approx 0; junior workers have high ChC_h and low CvC_v on bounded, verifiable tasks, yielding large RR. In Demirer, each human stage in the production pipeline adds a compounding Cv+CeC_v + C_e term; the 0.25 elasticity of substitution reflects an O-ring structure where each high-J stage compounds the attenuation of upstream AI gains. In Becker, experienced developers face low ChC_h, high CvC_v requiring full expertise to validate mature-codebase output, and elevated CeC_e where AI-generated code is plausible but incorrect — producing R<0R < 0. In Dell’Acqua’s outside-frontier condition, CeC_e rises sharply rather than gradually at the frontier boundary, because the model is systematically miscalibrated there, not merely noisy.

The BCG result — that workers couldn’t identify in advance which tasks were inside or outside the frontier — makes this operationally important. The frontier isn’t labeled. Workers who apply AI evenly across pattern-accessible and judgment-required tasks will see the gains from the former and the losses from the latter. What the evidence supports is that the capacity to identify which is which — to direct AI to the components where pattern distribution helps and retain judgment where it doesn’t — is itself an exercise of judgment. This makes judgment the meta-skill that governs the sign and magnitude of AI returns at any skill level.


Why the Boundary Is Structural

Two constraints make this boundary durable rather than a function of current model capability.1

The operational space problem. General judgment across the full range of professional contexts requires modeling a world of open-ended complexity — domain knowledge interacting with current events, client history, organizational dynamics, regulatory context, and information that doesn’t exist in any training set. The relevant patterns for a given professional’s specific situation change continuously with their practice. Modeling this exhaustively is computationally and economically unbounded. Useful AI operates in closed or semi-closed problem spaces where the relevant patterns are well-defined and training coverage is adequate. The moment the problem space opens into the full complexity of a professional’s active operational context, the economics of comprehensive pattern coverage collapse.

The dynamic integration problem. Human practitioners continuously update their operational model from experience. Every new case, project, or domain encounter revises the practitioner’s internal model of how the domain works. This is continuous cognitive integration of new information into an existing representational structure — not retrieval from a static training distribution. A language model updates at training time, not inference time. It processes inputs but doesn’t revise its world model in response to them. Replicating this dynamic integration for general judgment requires a complete closed-loop system — sensors, persistent world model, continuous update — which is economically viable only in heavily engineered, domain-specific contexts (autonomous systems, precision industrial control) where the integration cost is justified by the value of the specific capability. The general judgment capacity of a practitioner operating across a complex, dynamic, multi-domain practice is not economically reproducible by this route.

Both constraints are economic before they are technical. The pattern-accessible portion of professional work — the part AI demonstrably accelerates — is large enough to produce real productivity gains. The judgment-requiring portion is what remains when pattern distribution reaches its limit. Improving models expands the former; it doesn’t dissolve the latter.


What the Evidence Shows

The findings are consistent with a single organizing principle: AI returns are bounded by the judgment requirement of the work, at every skill level and in every domain studied.

Where workflow components can be decomposed and the pattern-accessible portions directed toward AI, it functions as a high-bandwidth pattern retrieval and synthesis layer. The output of that layer is only as good as the judgment applied in directing it and evaluating its results. Top-decile performance in the P&G study required human team plus AI: AI expanded the idea generation and cross-domain synthesis space; experienced humans applied the selection judgment that AI couldn’t replicate.

The economic implication is that AI shifts the value of judgment upward rather than reducing it. The returns from pattern distribution now accrue broadly. The premium on judgment — the capacity to navigate novel context, correctly evaluate AI output, and determine when patterns don’t apply — becomes the differentiating input in every domain where AI has penetrated. The practitioner’s ability to identify and direct the pattern-accessible components of their work is itself the skill that determines how much value AI returns.


References

  1. Brynjolfsson, E., Li, D., and Raymond, L. R. “Generative AI at Work.” Quarterly Journal of Economics 140(2): 889–942 (2025). https://www.nber.org/papers/w31161

  2. Cui, K. Z., Demirer, M., Jaffe, S., Musolff, L., Peng, S., and Salz, T. “The Effects of Generative AI on High-Skilled Work: Evidence from Three Field Experiments with Software Developers.” Management Science (2026). https://pubsonline.informs.org/doi/10.1287/mnsc.2025.00535

  3. Demirer, M., Musolff, L., and Yang, L. “Writing Code vs. Shipping Code: Productivity Effects Across Generations of AI Coding Tools.” NBER Working Paper 35275 (May 2026). https://www.nber.org/papers/w35275

  4. Becker, J., Rush, N., Barnes, E., and Rein, D. “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity.” arXiv:2507.09089 (July 2025). https://arxiv.org/abs/2507.09089

  5. Dell’Acqua, F., McFowland, E., Mollick, E., Lifshitz, H., Kellogg, K., Rajendran, S., Krayer, L., Candelon, F., and Lakhani, K. “Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality.” Organization Science (March 2026). https://pubsonline.informs.org/doi/10.1287/orsc.2025.21838

  6. Dell’Acqua, F., Ayoubi, C., Lifshitz, H., Sadun, R., Mollick, E., Mollick, L., Han, Y., Goldman, J., Nair, H., Taub, S., and Lakhani, K. “The Cybernetic Teammate: A Field Experiment on Generative AI Reshaping Teamwork and Expertise.” NBER Working Paper 33641 (March 2025). https://www.nber.org/papers/w33641


  1. These structural constraints are grounded in economic and cognitive science frameworks developed elsewhere. The O-ring complementary-stages logic underlying the pipeline attenuation result originates in Kremer (1993); its application to AI task chains is developed formally in Acemoglu (2024), NBER w32487 and cited directly in Demirer et al. (2026). The dynamic integration problem draws on the embodied cognition literature; see Wilson (2002) for a concise treatment of embodied cognition’s core claims, and Varela, Thompson, and Rosch, The Embodied Mind (MIT Press, 1991) for the foundational account. The economic non-viability of general sensor-integrated judgment replication is implied by the cost structure of closed-loop autonomous systems; military and precision industrial deployments represent the primary contexts where integration cost is economically justified.↩︎