Leanstral: Formal Verification as the New Quality Gate

16 March 2026 - 9 mins read

This post covers Mistral’s Leanstral release, March 2026.

What’s New This Week

Mistral released Leanstral on 16 March 2026 – the first open-source AI agent built specifically for Lean 4 formal verification. Alongside it, they published FLTEval, a new benchmark grounded in real PRs to the Fermat’s Last Theorem formalisation project rather than isolated competition math. The cost numbers in the benchmark comparison are the story: Leanstral at pass@2 outscores Claude Sonnet on FLTEval at $36 vs. $549.

Changelog

Date	Summary
16 Mar 2026	Initial publication covering the Leanstral release and what it means for engineering teams.

The bottleneck in AI-assisted engineering has moved. For the past two years, the constraint was model capability – could the AI write code that actually worked? That problem is largely solved. The new constraint is review. AI coding tools can generate code faster than any team can read, understand, and verify it. In high-stakes domains – payment logic, cryptographic implementations, authentication systems – that review bottleneck is both a velocity problem and a risk problem. Humans miss things. Always have.

Mistral’s Leanstral is a direct attempt to shift that bottleneck again. Not by making models faster, but by changing what “verified” means.

Proving vs. Testing

Most software quality assurance is evidence-based. A test suite says: “this code produced correct outputs for these inputs, under these conditions.” That’s valuable. It’s also fundamentally limited. You can only test the cases you thought to test.

Formal verification is a different category of claim. A formal proof says: “this code is correct for all valid inputs, by mathematical necessity.” Not “it passed the test suite.” Not “I’ve reviewed it and it looks right.” The specification is a precise logical statement. If the proof compiles, the code satisfies the spec. There are no edge cases the proof doesn’t cover – because the proof covers the entire input space.

This is why formal verification is standard in specific high-stakes domains: cryptographic protocol implementations, safety-critical control systems, compilers. In those contexts, “it looked fine in review” is not an acceptable quality gate. The cost of a missed edge case is too high.

The historical objection to formal verification in mainstream software engineering has never been “this doesn’t work.” It’s been “this is too expensive.” Writing formal specifications requires specialist expertise. Running verification tooling requires infrastructure. The whole thing takes longer than shipping. That calculus is what Leanstral is attempting to change.

What Lean 4 Is

Lean 4 is a proof assistant and dependently typed programming language. The proof assistant part is what matters here: Lean 4 can express precise logical specifications and verify that code satisfies them. It’s the same tool being used for the large-scale formalisation of Fermat’s Last Theorem – one of the most complex proofs in mathematics – and for verifying properties of Rust programs.

The fact that Lean 4 is both a programming language and a proof assistant matters for the AI integration story. You’re not working with a separate verification layer bolted onto existing code. You write the specification, you write (or generate) the implementation, and Lean 4 tells you whether the implementation satisfies the spec. The feedback is immediate and binary: it compiles or it doesn’t.

That binary feedback is exactly what makes Lean 4 tractable as an environment for an AI agent. The model can iterate on a proof, get precise error feedback from the Lean language server, and converge on a correct solution without human intervention. This is what Leanstral is built to do.

Leanstral: The Model

Leanstral-120B-A6B is a mixture-of-experts model with 120 billion total parameters and 6 billion active parameters per forward pass. The sparse architecture is deliberate: it makes the model efficient enough to run with parallel inference strategies (pass@N) at costs that compete with much larger dense models. It’s Apache 2.0 licensed, available as weights for self-hosting, through a free API endpoint, and integrated directly into Mistral’s Vibe agent mode.

The MCP integration is worth noting. Leanstral was specifically trained with lean-lsp-mcp – the Lean Language Server MCP – which means it can interface directly with Lean’s language server for real-time proof feedback during generation. This isn’t a model that generates Lean code and hopes it works. It can query the language server mid-proof and adjust.

FLTEval: A Benchmark That Matters

Alongside Leanstral, Mistral published FLTEval. Most AI math benchmarks test isolated competition problems – self-contained proofs with known solutions. That’s fine for measuring mathematical reasoning, but it doesn’t tell you much about how a model performs in real proof engineering work.

FLTEval tests something different: completing formal proofs and correctly defining new mathematical concepts in actual PRs to the FLT (Fermat’s Last Theorem) formalisation project. This is a living, complex proof repository. The model has to understand the surrounding context, work within an existing proof structure, and produce contributions that compile cleanly against the repository state.

The cost and performance comparison against Claude models is striking:

Model	Cost ($)	FLTEval Score
Claude Haiku	184	23.0
Claude Sonnet 4.6	549	23.7
Claude Opus 4.6	1,650	39.6
Leanstral (single pass)	18	21.9
Leanstral pass@2	36	26.3
Leanstral pass@4	72	29.3
Leanstral pass@8	145	31.0
Leanstral pass@16	290	31.9

At pass@2 ($36), Leanstral scores 26.3 against Sonnet’s 23.7 at $549. That’s 15x cheaper for a higher score on a task the model was specifically built for. Opus leads at 39.6, but at $1,650 – nearly 6x Leanstral’s pass@16 cost of $290, which scores 31.9.

The cost differential matters because formal verification has historically been priced out of most engineering contexts. At $549 per run for a frontier model that scores 23.7, you wouldn’t use this routinely. At $36 for something that beats it, the calculus shifts.

The Vibe Coding Problem

Vibe coding – generating code from natural language descriptions without deeply understanding the output – is increasingly common. Cursor, Copilot, Claude Code, Codex: the toolchain encourages fast generation and iterative refinement. The tacit assumption is that a human with sufficient expertise can evaluate the output. In practice, that assumption is under pressure. As code generation velocity increases, the review bandwidth doesn’t scale with it.

Mistral frames Leanstral as the answer to this: “Instead of debugging machine-generated logic, humans dictate what they want.” The specification is the thing a human writes and understands. The implementation is what the agent proves. You don’t need to read the proof – you need to trust that it compiles.

This is a significant shift in the mental model of code review. Instead of reviewing implementation, you review specification. If the spec is correct and the proof compiles, the implementation is correct by construction. The human never needs to understand how the code works – only what it’s supposed to do.

For teams already thinking about AI acceptance criteria and quality gates, this maps cleanly: the spec is the acceptance criterion, and the compiled proof is the passing gate. It’s complementary to the formal specs approach in the LLM era – you’re writing the spec anyway; now you can verify the output against it automatically. And it addresses the agent trust and pipeline hardening question directly: instead of trusting the agent’s output, you verify it.

What Changes for Engineering Teams

The practical applications cluster around high-stakes logic where current review processes are bottlenecks and single points of failure:

Payment and financial logic. Authorisation rules, idempotency guarantees, rounding behaviour: these are exactly the kind of properties that are hard to test exhaustively and catastrophic to get wrong. A formal spec that an AI agent then proves the implementation against is a stronger guarantee than a test suite.

Cryptographic implementations. Formal verification of cryptographic properties is already standard in security-critical open-source projects. Lowering the cost of generating those proofs makes it tractable for more teams.

Compiler and interpreter correctness. Lean 4 already has a track record here. AI-assisted proof generation extends that to teams without deep formal methods expertise on staff.

Auditing AI-generated code. Instead of reading AI-generated implementations to check them, you write the spec and let Leanstral prove them. The audit becomes specification review rather than code review.

MCP integration is the practical enabler: coding agents can call Lean’s language server directly, which means verification can be part of the generation loop rather than a separate post-hoc step.

What Doesn’t Change Yet

Honest caveats are warranted.

Most production codebases are not formally specified. Converting existing code to a formally verified state is a significant upfront investment. Leanstral doesn’t help with that – it helps with new code written against a spec. The adoption path is greenfield or high-stakes components, not bulk migration.

Writing good Lean 4 specifications still requires expertise. The spec is the thing a human has to understand and vouch for. If the spec is wrong, the proof proves the wrong thing. Garbage in, garbage out – just with more mathematical rigour. Teams need either to develop that expertise or hire it.

FLTEval is a strong benchmark, but formal mathematics is more constrained than typical production software. The types of properties you can express as Lean 4 specifications are not unlimited. Complex system interactions, performance guarantees, distributed consistency properties – these are harder to specify and verify than functional correctness of pure functions.

And Leanstral is not a replacement for general code review. It’s a complement for the subset of code where formal verification is both feasible and valuable.

The Shift That Matters

Leanstral doesn’t make AI-generated code safe by default. Writing a bad specification and proving it is no better than testing with inadequate coverage. The model can fail on sufficiently complex proofs. Lean 4 expertise remains a constraint.

What Leanstral changes is the economics. Formal verification has been technically possible for decades. It’s been economically impractical for most teams. At $36 for a result that beats frontier models on real proof engineering work, the cost objection to formal verification in high-stakes components starts to look weaker. Not gone – but weaker.

The bottleneck in AI-assisted engineering has moved from generation to review. Leanstral is a bet that it can move again – from review to specification. If that bet lands, the meaningful human contribution to high-stakes code becomes writing precise descriptions of what code should do, not reading implementations to check whether it does.

That’s a different job. It’s also, arguably, the more valuable one.