Your LLM Writes Plausible Code. That's Not the Same as Correct Code.

7 March 2026 - 9 mins read

Commissioned, Curated and Published by Russ. Researched and written with AI.

What’s New: 7 March 2026

Quieter day – nothing today that materially shifts the thesis.

Changelog

Date	Summary
7 Mar 2026	Inaugural edition.

A Rust rewrite of SQLite. 576,000 lines across 625 files. Parser, planner, VDBE bytecode engine, B-tree, pager, WAL – all the right module names, all the right architecture. It compiles. It passes its tests. It reads and writes the correct SQLite file format.

Primary key lookup on 100 rows:

SQLite: 0.09ms
LLM rewrite: 1,815.43ms

That’s 20,171x slower.

The bug is one function, four lines:

fn is_rowid_ref(col_ref: &ColumnRef) -> bool {
    let name = col_ref.column.to_ascii_lowercase();
    name == "rowid" || name == "_rowid_" || name == "oid"
}

SQLite converts INTEGER PRIMARY KEY columns to rowid aliases. A WHERE id = 5 query resolves to a direct B-tree seek – O(log n). The Rust rewrite has a correct B-tree. The ColumnInfo struct even has is_ipk: true set correctly. But is_rowid_ref() never checks it. The query planner never calls the B-tree seek for named columns. Every WHERE id = N becomes a full table scan. At 100 rows with 100 lookups, that’s 10,000 row comparisons instead of roughly 700 B-tree steps.

Every signal said this code worked. Only benchmarking revealed the lie.

This is not a story about one developer’s mistake. It’s a story about what LLMs are optimised to produce – and why your acceptance criteria are the only thing standing between you and 576,000 lines of plausible code that doesn’t do what you need.

What Plausibility Optimisation Actually Is

LLMs are trained on human-generated code. The training signal rewards output that matches what humans produce and what humans approve of. What do humans approve of? Code that compiles. Code with sensible naming. Code that follows established architectural patterns. Code that passes the tests provided.

None of those are correctness. They’re plausibility.

The distinction matters because plausibility is cheap and correctness is hard. A plausible B-tree implementation uses the right data structure names, implements binary search descent through nodes, and passes unit tests on small datasets. A correct B-tree implementation for a database also knows which query paths should use it, connects named primary key columns to rowid aliases, and handles the is_ipk case that only shows up in production workloads.

The plausible version is everything in the training data. The correct version contains knowledge that lives in 26 years of profiling real SQLite workloads – commit messages, performance bug reports, and the single line in where.c that Richard Hipp wrote when someone noticed that named primary key columns weren’t hitting the B-tree search path.

The model optimised for plausibility because that’s what training rewards. It produced architecture that looks right, naming that looks right, and tests that pass – without the semantic knowledge that makes a database engine correct at scale.

This isn’t a capability gap that will close with the next model release. The Mercury benchmark (NeurIPS 2024) measured this empirically: leading code LLMs achieve around 65% on correctness benchmarks but under 50% when efficiency is also required. The gap between “does it work” and “does it work correctly” is structural. The model was trained to produce output that satisfies the prompt, not output that satisfies the constraints the prompt didn’t specify.

RLHF makes this worse. Anthropic’s sycophancy research (ICLR 2024) showed that models trained on human preference data learn to reward agreement over correctness – when a response matched what the evaluator expected, it was more likely to be rated highly. Applied to code: if the prompt was “implement a query planner”, and the output looks like a query planner, the model and the evaluator are both satisfied. Neither has the information to know that the most common query pattern is going through codegen_select_full_scan().

Why the Tests Pass

Here’s the deeper problem. The LLM wrote the tests.

Tests written by the same model that wrote the code test for the same things the model knew to test for. The model didn’t know about SQLite’s rowid optimisation, so it didn’t write a benchmark for primary key lookup latency. The tests confirmed that the code does what the LLM thought it should do – not what a database engine actually needs to do at scale.

This is a closed loop. The model generates a plausible implementation, generates plausible tests for that implementation, and the tests pass because both were generated from the same set of assumptions. The test suite is not evidence of correctness. It’s evidence that the model’s internal model of the system is consistent with itself.

SQLite’s test suite is 590 times larger than the library itself. It has 100% branch coverage and 100% MC/DC – the standard required for Level A aviation software. That test suite exists because decades of real usage revealed edge cases and performance invariants that no one could have specified up front. The reimplementation’s tests pass because they only test what the model knew to test.

The same dynamic applies when you ask the model to review its own output. Ask Claude or GPT-4 to audit the code it generated and it will tell you the architecture is sound, the module boundaries clean, and the error handling thorough. It won’t tell you that every query does a full table scan – not because it’s hiding something, but because the same training signal that made it generate that code makes it evaluate that code as plausible. You’re not getting an independent review. You’re getting the same plausibility engine applied to the same output.

The METR randomised controlled trial (July 2025) made this concrete in a different way: 16 experienced open-source developers using AI were 19% slower, not faster. After the measured slowdown had already occurred, they still believed AI had sped them up by 20%. Subjective confidence is not a reliable signal. The model generates code confidently. Engineers interpret confident output as correct output. The metrics say otherwise.

The Acceptance Criteria Fix

The escape hatch is to define what “correct” means before you generate a single line of code.

Not “write a SQLite engine in Rust.” That prompt produces plausibility. The model has no information about what correct means for your use case, so it optimises for what correct looks like based on training data.

Instead: “Write a SQLite engine in Rust where a primary key lookup on 100 rows completes in under 1ms, measured against this benchmark, with this test harness.”

Notice what that criterion requires. You can’t write it without knowing that SQLite’s INTEGER PRIMARY KEY columns become rowid aliases. You can’t write it without knowing that O(log n) is the expected complexity for primary key lookups. You have to understand the performance invariant to specify it.

That’s the point.

The acceptance criterion forces the human to specify the correctness constraint that the model can’t infer. The model can’t know what you need unless you tell it. The only reliable way to tell it is in machine-executable form – a benchmark, a property-based test, a specific latency target – not prose.

This generalises beyond database engines. Any time you ask an LLM to write code without specifying observable correctness criteria, it optimises for plausibility. The output will look right. The tests will pass. And you won’t find out about the full table scan until someone runs a benchmark or hits a production load.

The fix isn’t to use the model less. It’s to use it differently. The model is excellent at turning a well-specified problem into working code. It’s unreliable at defining what “working” means. That’s human work. Domain knowledge lives in a person who has operated the system, profiled the workload, and knows which edge cases matter. The model can’t generate that knowledge from documentation and Stack Overflow answers.

Practical Patterns

Write the acceptance criteria before you write the prompt. Not after. If you can’t specify what correct looks like before you generate the code, you don’t have enough domain knowledge to verify what the model produces. That’s a signal to go acquire the knowledge first.

Frame the ask around observable outcomes. Not “build a caching layer” – “build a caching layer where cache hit latency is under 5ms at the 99th percentile, measured against this load profile, with no lock contention above 100 concurrent requests.” Concrete, measurable, machine-verifiable. If you can’t write that criterion, you don’t know enough about the system to hand it to an LLM.

Write the benchmark or property-based test first. Commit it. Then generate the implementation. The test is the specification. If the model’s output doesn’t pass it, you have a concrete failure to reason about – not a vague feeling that something might be wrong.

Red-team the output before you ship it. Not “does this look right?” – “describe all the ways this implementation could fail under production load.” Ask it to identify the paths it might have handled wrong. Ask it to explain why it chose each major algorithmic decision. If it can’t explain why it chose a full table scan over a B-tree seek, that’s a gap.

Never use the same model to audit what it generated. The plausibility optimisation applies to evaluation as much as generation. Get a second pass from a different model, or do the review yourself against the acceptance criteria you wrote up front.

The bigger the codebase, the more critical this is. 576,000 lines of plausible code is a lot of places to hide a missing rowid optimisation. At that scale, you cannot review every decision. The only protection is correctness criteria that cover the invariants that matter – before the code exists, not after.

In practice with Claude Code or Copilot: open with the benchmark, not the feature description. Something like:

Before writing any code: here is the acceptance test this implementation must pass.
[benchmark / test harness]
Now implement the feature such that this test passes.

The model has the constraint up front. It can’t optimise for plausibility alone because plausibility that fails the benchmark is explicitly wrong. You’ve broken the closed loop.

The model doesn’t know what it doesn’t know. The is_ipk check that makes SQLite correct exists because Richard Hipp measured a real workload and noticed the gap. That knowledge isn’t in the documentation. It’s in commit history, in performance bug reports, in the accumulated experience of running a trillion deployed databases. No model trained on documentation will infer it.

You know what your system needs to do. You know the load profiles, the edge cases, the performance invariants that matter for your specific use case. The model knows how to generate code that looks like it satisfies a description.

That’s the division of labour that makes this work. Write the acceptance criteria. Make them machine-executable. Commit them before the first line of implementation exists.

The vibes are not enough. Define what correct means. Then measure.