The Productivity Number You're Being Shown Is Real. So Is the Other One.

11 March 2026 - 9 mins read

Commissioned, Curated and Published by Russ. Researched and written with AI.

What’s New

The DX Engineering Enablement newsletter published new data this month: 40 companies tracked from November 2024 through February 2026, AI usage up 65% across that cohort, PR throughput up 9.97%. Meanwhile the Fortune piece on the Solow productivity paradox dropped in February, citing NBER research covering 6,000 executives across four countries. The picture is consistent: adoption is near-universal, measured impact is modest, claimed impact is not.

Changelog

Date	Summary
11 Mar 2026	Initial publication.

GitHub Copilot makes developers 55% more productive.

That number is real. It comes from a peer-reviewed study. GitHub published it, researchers ran controlled experiments, and the methodology is documented. Nobody invented it.

Here’s another number: 9.97%. That’s the actual increase in PR throughput DX measured across 40 companies over 15 months, as AI usage within those organisations rose by 65%.

Both numbers are true. They do not measure the same thing. And the distance between them is where a lot of expensive decisions are being made right now.

What the Studies Actually Measured

The GitHub Copilot study measured how long it took developers to complete a specific coding challenge. One task, one session, unfamiliar codebase, controlled conditions. Developers using Copilot finished 55% faster. That is a real finding. It is also a finding about a single task, selected to showcase what the tool does well.

This is not unusual. Vendor-commissioned productivity studies routinely measure the thing the tool is designed to accelerate, under conditions that maximise that effect. The number isn’t false – it’s narrow.

The DX methodology is different. They track delivery metrics – PR throughput, deployment frequency, change failure rate – longitudinally, before and after AI adoption, across real engineering teams shipping real software. The signal they’re looking for isn’t “how fast can a developer complete this task?” It’s “is the team shipping more software, more reliably, than it was before?”

The answer, across 135,000+ developers at 435 companies: 91% now use AI coding tools. Average self-reported time savings: 3.6 hours per week. Actual measured improvement in delivery throughput: roughly 10%.

Three and a half hours a week per developer should compound into significant throughput gains. It isn’t. The DX team’s explanation is blunt: writing code was never the bottleneck.

Planning, alignment, scoping, code review, debugging, handoffs – these are where engineering time actually goes. Code generation represents something like 25-35% of the total software development lifecycle. A 55% speedup on 30% of the job gives you roughly 15% improvement in the best case, before accounting for downstream effects. In practice, the gains are being partially absorbed – and in some cases reversed – by what happens after the code is written.

The Perception Problem

In July 2025, METR published a randomised controlled study that should be required reading for anyone making AI tooling decisions.

Sixteen experienced open-source developers. 246 real tasks. Proper experimental design. The expectation was to measure how much faster AI made them.

Result: developers using AI tools took 19% longer to complete tasks than those working without them.

The developers themselves estimated they were 20% faster.

That 39-point gap between perception and reality is the number worth sitting with. Not because it proves AI tools are harmful – the study covers a specific population (experienced developers on complex open-source tasks) and shouldn’t be extrapolated carelessly. But because it tells you something important about the nature of the productivity signal you’re getting from your team.

Developers feel faster. The first draft arrives sooner. The boilerplate is done. There’s less friction, less of the blank-page problem, more of a feeling of flow. That feeling is genuine. It corresponds to something real in the experience of using these tools.

It does not reliably correspond to output.

This matters for how you interpret self-reported productivity surveys, which are the primary measurement instrument most organisations are using. If developers systematically overestimate their own speed gains, survey data will consistently read better than delivery metrics. And if leadership is making decisions based on the survey data while the delivery metrics are sitting somewhere else, that’s not a small calibration error.

For more on the cognitive overhead side of this – the cost of overseeing AI-generated work, and the 4-hour ceiling pattern – I’ve covered that separately. The short version: sustained AI-assisted work creates its own form of cognitive load that self-reported time savings don’t capture.

The Solow Echo

In 1987, Robert Solow wrote: “You can see the computer age everywhere except in the productivity statistics.”

Personal computers were ubiquitous. Software was transforming how companies worked. And measured productivity growth had slowed, not accelerated.

Fortune ran a piece in February 2026 pointing out that the same argument now applies to AI. Apollo chief economist Torsten Slok: “AI is everywhere except in the incoming macroeconomic data.” An NBER study of 6,000 executives across the US, UK, Germany, and Australia found that nearly 90% reported AI had no measurable impact on employment or productivity over the previous three years. The executives who use AI average 1.5 hours per week. 374 companies in the S&P 500 mentioned AI positively in earnings calls. The broader productivity statistics don’t show it.

MIT economist Daron Acemoglu – Nobel laureate – modelled a 0.5% productivity increase across the economy over the next decade from AI. His framing: better than zero, but disappointing relative to the promises.

The Solow paradox eventually resolved. Computers did drive massive productivity gains – but with a 10-15 year lag while organisations learned how to restructure around the technology. The gains came not from automating existing processes but from redesigning them. The bottlenecks moved, workflows changed, and eventually the productivity statistics caught up.

The question for AI is whether we’re in the early part of that same curve, or whether the analogy breaks down. There are reasons to think it might: software development iteration cycles are faster than they were in 1987, AI capabilities are improving faster than PC capabilities improved in the late 1980s, and some domains are already showing genuine productivity gains at scale.

But “we might be in the early part of the Solow curve” is not the same claim as “AI delivers 2-3x productivity gains.” The former is a reasonable bet about the future. The latter is a claim about the present that the longitudinal data doesn’t support.

Where the Gains Are Real and Where They Aren’t

This isn’t a dismissal of AI productivity gains. It’s a calibration.

Gains that appear in the data:

Boilerplate and scaffolding: real time savings, measurable
Test generation: meaningful reduction in a task developers routinely defer
Documentation and code explanation: clearer gains here than in core coding
Junior developers on well-defined tasks: genuine acceleration, consistent across multiple studies
Writing (non-code): stronger gains than code, possibly because prose has clearer acceptance criteria than software

Where gains are overstated or absent:

Senior engineers on novel, complex problems: the METR finding suggests net negative on tasks requiring deep domain expertise
System design and architecture: no evidence of meaningful gains at any seniority level
Code review quality: potentially declining – more code to review, AI-generated code harder to evaluate, review becoming a bottleneck that grows faster than generation speeds up
Team-level delivery velocity: ~10% longitudinally, not 50-80%

The plausibility vs correctness distinction is relevant here. AI-generated code is often plausible-looking in a way that passes quick review and fails in production. The cognitive cost of distinguishing plausible from correct scales with task complexity – which is part of why senior engineers on hard problems are the group most likely to be net-slower, not net-faster.

Faros AI measured this effect across 10,000+ developers on 1,255 teams. Teams with high AI adoption completed 21% more tasks and merged 98% more PRs. PR size grew 154%. Review time went up 91%. Bug rate up 9%. Organisational DORA metrics: flat. Cursor’s response to this finding was to acquire Graphite, a code review startup – which is a more honest statement about where the constraint actually lives than anything in their marketing.

What This Means for Engineering Leadership

If you’re evaluating AI tooling investments, adoption mandates, or – more seriously – headcount reductions justified by AI productivity claims, here’s the framework:

Ask what the study measured. Single-task benchmarks and longitudinal delivery metrics are not the same thing. A vendor showing you task completion speed is not showing you engineering throughput. Ask what happened to deployment frequency, change failure rate, and cycle time in the six months after adoption. If they don’t have that data, they’re showing you the input, not the output.

Separate self-reported from measured. Surveys are a leading indicator. The METR finding suggests they can be a systematically misleading one. Survey data is useful for understanding friction and experience; it’s not a reliable proxy for output. If your primary measurement instrument is developer satisfaction surveys, you don’t have productivity data – you have sentiment data.

Track the downstream effects. PR throughput going up is not the same as delivery velocity going up. If code generation accelerates but code review becomes the new bottleneck, you’ve moved the constraint, not removed it. Measure the full cycle, not just the step that improved.

Be sceptical of linear scaling assumptions. The logic “we saved 3.6 hours per developer per week, multiply by headcount, that’s X person-years of capacity” does not hold in systems with non-coding bottlenecks. If code review, planning, and alignment are the constraints, adding coding capacity doesn’t add delivery capacity.

Think about who captures the gains. The data suggests junior developers on well-defined tasks capture more of the productivity benefit than senior engineers on complex problems. An organisation that measures average productivity gains and then makes headcount decisions based on that average may be cutting the people most effective at the work AI is worst at.

The ManpowerGroup 2026 Global Talent Barometer surveyed 14,000 workers across 19 countries and found regular AI use up 13% in 2025, while confidence in AI’s utility was down 18%. Adoption is high. Scepticism is growing. The gap between the two is where the honest conversation about productivity measurement needs to happen.

The productivity gains from AI tools are real. The 10% longitudinal figure from DX is not evidence that AI is useless – it’s evidence that the gains are smaller, more specific, and more unevenly distributed than the headline numbers suggest.

The 55% figure from the Copilot study is not a lie. It is a measurement of a particular thing, under particular conditions, for a particular population. It belongs in the same category as every other benchmark: useful context, not a deployment prediction.

The problem is that executives are making irreversible decisions – about headcount, about team structure, about investment – using the 55% number in contexts where the 10% number is the relevant one. The measurement gap isn’t a technical footnote. It’s where the risk lives.

Engineering discipline means measuring before deciding. That applies here.

References

DX Engineering Enablement Newsletter (2026). AI productivity gains are 10%, not 10x. https://newsletter.getdx.com/p/ai-productivity-gains-are-10-not
DX (2025). AI-assisted engineering: Q4 impact report. https://getdx.com/blog/ai-assisted-engineering-q4-impact-report-2025/
METR (2025). Measuring the impact of early-2025 AI on experienced open-source developer productivity. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
Dubach, P. (2026). 93% of developers use AI coding tools. Productivity hasn’t moved. https://philippdubach.com/posts/93-of-developers-use-ai-coding-tools.-productivity-hasnt-moved./
Fortune (2026). Thousands of executives aren’t seeing AI productivity boom, reminding economists of IT-era paradox. https://fortune.com/2026/02/17/ai-productivity-paradox-ceo-study-robert-solow-information-technology-age/
Solow, R. (1987). We’d better watch out. New York Times Book Review.
Acemoglu, D. (2024). The Simple Macroeconomics of AI. MIT / NBER. https://economics.mit.edu/sites/default/files/2024-04/The%20Simple%20Macroeconomics%20of%20AI.pdf
Faros AI (2025). The AI productivity paradox. https://www.faros.ai/ai-productivity-paradox