Formal Specs in the LLM Era: The Validation Layer AI-Generated Code Is Missing

12 March 2026 - 9 mins read

Commissioned, Curated and Published by Russ. Researched and written with AI.

What’s New

Informal Systems published a detailed post-mortem on the Malachite Fast Tendermint refactor in November 2025. The timeline (one week, not months) and the methodology (executable specs as the validation layer between English and implementation) are the most concrete proof points I’ve seen for formal methods having a practical role in AI-assisted engineering. Nothing this week has changed that assessment.

Changelog

Date	Summary
12 Mar 2026	Initial publication.

You’ve stared at a large AI-generated diff wondering if it’s actually right. Not a line here and there – hundreds of lines, touching core logic, with a test suite that passes. Everything looks fine. That’s the problem.

This isn’t a generation problem. LLMs are impressively good at generating code. The problem is validation. The code looks plausible. The tests pass. And you have no reliable mechanism to verify that “plausible and passing” is the same as “correct.”

One team has a concrete answer to this. It involves formal specifications, model-based testing, and a workflow that reduced a months-long refactor to a week. Here’s what they did and why it works.

What Informal Systems Built

Informal Systems is a distributed systems company with a formal verification focus. They built Quint – an executable specification language designed to sit between natural language and code.

That positioning is the key idea. English specifications are ambiguous and unverifiable. You can write a detailed English description of a protocol and still have two engineers come away with different mental models of how it behaves. Code is precise, but it’s hard to reason about at the system level: the implementation details obscure the protocol logic.

Quint occupies the middle ground. It’s abstract enough to reason about system behaviour without implementation noise. And unlike English, it’s executable – you can run it, explore it, and check properties against it mechanically. The tooling includes a simulator, a model checker, and a REPL. You can ask questions like “can I reach a state where X happens?” and get a definitive answer, not a guess.

This is the same category of tool as TLA+ (Leslie Lamport’s temporal logic language), Alloy, or PlusCal. If you’ve used any of those, Quint will feel familiar – it’s more ergonomic and comes with tighter tooling integration. If you haven’t, the important thing to understand is what the category does: it lets you describe system behaviour at a level above code, and then verify that behaviour mechanically.

Model-based testing closes the loop. Once you have a validated spec, you run the same scenarios through both the spec and the implementation, and verify they produce identical outcomes. Confidence established in the spec transfers to the code – not by hope, but by mechanical comparison.

The Experiment: Fast Tendermint in a Week

Malachite is a production-grade BFT consensus engine implementing the Tendermint protocol. It’s used by Circle – the company behind USDC – for Arc, their blockchain platform. This is not a demo project. It’s consensus logic for a production financial system.

The change: switch from classical Tendermint (which requires 3F+1 nodes to tolerate F Byzantine nodes) to Fast Tendermint (which requires 5F+1 but achieves better performance through one fewer communication step). The traditional estimate for a change of this scope was a couple of months. They did it in a week.

Manu, one of Informal Systems’ protocol designers, did the actual protocol design: the research, the English description of Fast Tendermint, the draft proofs. That work is not outsourced to AI. What changed was everything downstream of the design.

Two things already existed that made this possible: a Quint specification for the original Tendermint, and Manu’s English description of the new variant. The experiment was about what you can do with those two artefacts when you add AI and Quint’s tooling to the workflow.

The Four-Step Workflow

Step 1: Spec change. Feed the AI the existing Quint spec and the English protocol description of the new variant. AI translates the English changes into a modified Quint spec. After each iteration, run quint parse and quint typecheck – structural validation before behavioural validation. This is a tight feedback loop: catch syntax and type errors immediately, don’t let them accumulate.

Step 2: Spec validation. This is where human time goes. The spec is structurally sound; now verify it’s behaviourally correct. Query it interactively: “Can this re-broadcast mechanism trigger?” “When does the alternative decision path activate?” “If I model too many Byzantine nodes, does disagreement become possible?” AI assists with tool invocations, but domain expertise drives the questions. You’re building confidence through exploration – checking that every flow is reachable, that properties hold, that your assumptions are accurate.

On the Fast Tendermint spec, this phase found two bugs in Manu’s English description in one afternoon. Small bugs, easy to fix at the spec level. The kind of bugs you would have discovered three weeks into an implementation, after significant context-switching cost.

Step 3: Code change. With a validated spec in hand, generate the implementation. The critical difference from standard AI-assisted coding: AI gets the old Quint spec, the new Quint spec, and the diff between them as its input – not an ambiguous English description. Formal models rather than prose. The implementation still requires expertise (Malachite has architectural complexity that isn’t in the abstract spec: storage decisions, optimisation layers, component boundaries), but the protocol logic is precisely defined.

Step 4: Code validation. AI generates glue code that connects the spec scenarios from Step 2 to the implementation. The same scenarios that validated the spec are replayed against the code. If the spec and the code produce identical outcomes, the validation confidence transfers. This becomes a test suite in CI – not tests that check what the developer thought to test, but tests derived from the formal model of correct behaviour.

The Key Reframe: AI as Translator

There’s a specific reason this workflow works, and it’s worth being explicit about it.

LLMs are good at translation between representations. They translate English into Quint, Quint into code, test scenarios into glue code. These are hard tasks – the models are genuinely impressive at them. What LLMs are not good at is design: reasoning about protocol correctness, understanding Byzantine fault tolerance guarantees, knowing whether a proposed mechanism actually achieves the property it’s supposed to achieve.

The workflow is built around this asymmetry. Humans do design (Manu) and validation (the spec exploration phase). AI does translation (English to spec, spec change to implementation). The spec becomes the boundary: it’s the thing humans validate, and the thing AI translates from. Because the spec is validated before code generation starts, the translation target is correct. AI can still get the translation wrong, but model-based testing catches those errors mechanically.

This is a significant reframe of what “AI coding” means. The productivity gain isn’t AI generating architecture – it’s AI handling the translation layers between levels of abstraction efficiently, while humans retain control of the design decisions that determine correctness.

I’ve written before about the validation gap in AI-assisted development and the difference between plausibility and correctness in AI outputs. This workflow is the most concrete answer I’ve seen to both problems: it introduces a layer that is both human-readable (abstract enough to reason about) and mechanically verifiable (executable enough to check). The spec is the thing that makes validation tractable.

Model-Based Testing: Why This Scales

The reason code review doesn’t scale to AI-generated code isn’t that reviewers are careless. It’s that the volume and complexity of AI-generated diffs exceeds what code review was designed for. You can review a hundred-line change carefully. A five-hundred-line change touching consensus logic is a different problem.

Model-based testing provides a validation mechanism that doesn’t degrade with diff size. The spec is your model of correct behaviour. The test suite derived from the spec checks that the implementation matches that model. If the spec is validated and the code matches the spec, you have confidence – not because you reviewed every line, but because the comparison is mechanical.

Informal Systems is also working on trace validation: taking execution traces from real environments (test nets, production) and verifying that those traces are compliant with the spec. This closes the remaining gap. You’re not only checking that the code can behave like the spec; you’re checking that it actually does.

The spec also serves as a debugging reference. When AI is debugging test failures, it can read the spec and rule out hypotheses that contradict it. “The spec says we don’t broadcast X during Y – that’s by design, not a bug” is a concrete stop on a debugging path that would otherwise involve second-guessing every assumption. With an English spec, that confidence isn’t available. With a validated Quint spec that’s been thoroughly explored, it is.

When This Approach Makes Sense

Writing the first Quint spec for a system is a few days of work for a complex protocol. That’s a real cost, and it’s worth being honest about when the ROI is there.

The case is strongest for: distributed systems (consensus, CRDTs, state machines), protocol implementations where correctness is a hard requirement, security-critical code, and anything that changes regularly over a long lifetime. Malachite is at the high end – BFT consensus for a production financial system. But the pattern scales down. Any system where you’ve said “I’m not sure this is actually correct, and I’m not sure how to verify it” is a candidate.

The case is weaker for CRUD services, straightforward APIs, and business logic where correctness means “it does what the ticket says.” For those, good tests and clear acceptance criteria are sufficient. Formal specs are overhead when the correctness requirements are simple and testable directly.

The separation of concerns in AI agent pipelines applies here too: formal specs are most valuable at the boundaries where correctness requirements are hard and verification is non-obvious. For everything else, they’re probably not worth the investment.

The Failure Mode We’ve All Hit

The failure mode isn’t “AI wrote bad code.” It’s “AI wrote code we couldn’t validate.” Those are different problems with different solutions.

Informal Systems’ result – a complex BFT consensus change, traditionally two months, done in a week with high confidence – is a proof point worth taking seriously. The mechanism is specific and replicable: executable specifications as the validation layer between English design and code implementation, with model-based testing bridging spec confidence to implementation confidence.

Formal methods have a reputation for being academic and impractical. That reputation is partly deserved – the tooling has historically been difficult and the learning curve steep. But the emergence of LLMs changes the economics. If AI can handle the translation work (English to spec, spec to code), and humans can focus on the validation work (which is where the expertise actually matters), the upfront cost of writing a formal spec starts to look very different.

Not every codebase needs this. But for the systems where correctness is non-negotiable, executable specifications are no longer a research curiosity. They’re a practical answer to the validation problem – and a week versus months is the kind of proof point that earns a second look.