This post discusses an AI system that automates part of the Linux kernel review process. All statistics are from the Sashiko project’s own measurements and Roman Gushchin’s public announcement.


What’s New

Sashiko was publicly announced by Roman Gushchin on 18 March 2026. The project has been transferred to the Linux Foundation for hosting, with Google continuing to fund the token budget and infrastructure. This is the initial publication covering the announcement.


Changelog

DateSummary
18 Mar 2026Initial publication covering Sashiko’s public launch and 9-stage review pipeline.

Every patch submitted to the Linux kernel mailing list is now being read by an AI agent before a human maintainer sees it.

Sashiko – built by Roman Gushchin of Google’s Linux kernel team – is an agentic code review system that monitors the linux-kernel mailing list and runs every incoming patch through a nine-stage review pipeline. It’s open-source, now hosted under the Linux Foundation, and funded by Google. In testing against 1,000 confirmed upstream bugs (commits carrying “Fixes:” tags), it caught 53% of them. Every single one of those bugs had already cleared the full human review pipeline. They were submitted, reviewed, merged, and only later identified as bugs. Sashiko would have flagged more than half of them before they landed.

That’s the number worth sitting with.

The scale problem

The Linux kernel receives thousands of patches per day from hundreds of contributors across the world. Maintainers – the engineers with the subsystem expertise to evaluate whether a patch is correct, safe, and consistent with the kernel’s conventions – are not a scalable resource. There is no way to hire your way out of this. The knowledge required to review a memory management patch, or a driver for a specific piece of hardware, or a change to the scheduler, takes years to accumulate and is specific to individual subsystems.

The review bottleneck is real and documented. Patches sit in the queue. Some get reviewed quickly; others wait. Some get merged with issues that are only found later, when something breaks in production on hardware that nobody tested, or when a security researcher looks at the right code path.

Sashiko doesn’t try to replace that expert judgment. It augments it – running a pass over every patch before it reaches a maintainer, flagging the categories of issues that are most likely to slip through under time pressure.

What Sashiko does

The name comes from 刺し子 – a Japanese textile technique, literally “little stabs”, used to reinforce fabric at points of wear. The metaphor is deliberate: not replacing the fabric, reinforcing it.

The review pipeline has nine stages, each designed to mimic a different specialist reviewer:

Stage 1 looks at the big picture: does the patch make architectural sense? Does it introduce UAPI breakages or conceptual problems before you’ve even looked at the implementation?

Stage 2 checks whether the code actually does what the commit message claims. Missing pieces, undocumented side-effects, API contract violations.

Stage 3 traces execution flow: logic errors, missing return-value checks, unhandled error paths, off-by-one errors in bounds handling.

Stage 4 focuses on resource management: memory leaks, use-after-free, double frees, object lifecycle issues across queues, timers, and workqueues.

Stage 5 investigates concurrency: deadlocks, RCU rule violations, thread-safety, race conditions.

Stage 6 runs a security audit: buffer overflows, out-of-bounds reads and writes, TOCTOU races, uninitialized memory leaks.

Stage 7 applies a hardware engineer’s lens: register accesses, DMA mapping, memory barriers, state machine correctness – the things that only matter in driver code but matter a lot when they’re wrong.

Stage 8 is where the pipeline earns its keep on false positives. It consolidates findings from stages 1 through 7, deduplicates overlapping concerns, and attempts to logically prove or disprove each finding before it surfaces. This is what keeps the false positive rate in check.

Stage 9 generates the output: a polite, standard, inline-commented email reply in LKML format.

The system is self-contained – no external agentic CLI tools – and currently supports Gemini (primary, tested with Gemini 3.1 Pro) and Claude. Ingestion monitors lore.kernel.org for new submissions. There’s a web interface at sashiko.dev. It’s written in Rust.

What “53% of bugs human reviewers missed” actually means

The framing matters here. Gushchin’s statement – “100% of these issues were missed by human reviewers” – is not a criticism of kernel maintainers. It’s a description of what the benchmark is measuring.

The test corpus is specifically bugs that made it into the main tree: real bugs, confirmed by the fact that someone later submitted a fix with a “Fixes:” tag pointing back to the original commit. These are not theoretical issues. They passed the full human review process – submitted, reviewed by knowledgeable engineers, merged by a maintainer – and were still wrong.

Sashiko caught 53.6% of them retroactively, running against commits that had already landed.

The claim is not “Sashiko catches 53% of all bugs in the kernel.” It’s narrower and more useful than that: of bugs that passed human review and were later confirmed as bugs, Sashiko would have flagged more than half before they were merged. That’s a meaningful signal. The baseline it’s beating is not zero – it’s the full existing review process.

False positives and workflow overhead

The false positive rate is under 20%, and Gushchin notes that the majority of that is in a “gray zone” – borderline issues, debatable style concerns, things where a reasonable reviewer might disagree rather than clear false alarms.

In practice, this means maintainers processing Sashiko’s output will find that roughly four out of five flagged issues are worth looking at. The fifth requires a judgment call that resolves in favour of the existing code. That’s overhead, but it’s bounded overhead with a clear upside.

The question for any team adopting this pattern is whether the workflow cost of processing false positives is worth the genuine catches. At 80%+ signal, the answer is almost certainly yes – provided the volume of flags per patch is manageable and the output format is useful. Sashiko’s Stage 8 verification pass, which attempts to disprove findings before surfacing them, is specifically designed to keep that ratio healthy.

Why the Linux kernel is the right proving ground

The kernel is a deliberately difficult test case. It has real-time constraints, multiple memory models (SMP, NUMA), complex and sometimes undocumented locking hierarchies, architecture-specific code paths, hardware driver constraints, a strict no-regressions policy, and decades of accumulated conventions. The gap between “compiles cleanly” and “correct kernel code” is enormous.

Most of that accumulated knowledge lives in the heads of maintainers and in git history, not in any formal specification. There is no complete, machine-readable description of all the locking invariants, all the valid DMA patterns, all the subsystem-specific conventions a patch has to respect.

If agentic code review works at the kernel level – catching memory safety issues, concurrency bugs, and architectural problems in code that experts regularly miss – it works in most enterprise codebases. The kernel is the hardest case. Proving the technique here is a meaningful data point.

What the template means for other codebases

The architectural pattern Sashiko uses is replicable. Decompose the review problem into specialised perspectives. Give each stage a focused prompt. Aggregate the findings. Run a synthesis stage that attempts to disprove weak findings before they surface. Generate output in the format reviewers actually use.

This is not exotic. It’s the same multi-stage agent pipeline approach that’s emerging across the industry for complex reasoning tasks – the same intuition that says one general-purpose pass misses things that a sequence of focused passes would catch.

The formal verification work happening in parallel is chasing stronger guarantees, but at higher cost and narrower applicability. Code review augmentation is the lower-friction path: it works on existing codebases, requires no formal specifications, and operates in the same workflow reviewers already use.

Any team with a large enough codebase that review bottlenecks are real – and that’s most engineering organisations above a certain size – now has a working template. The subsystem-specific prompts matter (Sashiko uses kernel-specific context developed by Chris Mason, alongside generic prompts), but the structure is portable. The same logic applies to LLM-defined acceptance criteria for code: the value comes from systematically covering ground that human attention misses under time pressure.

The actual claim

Sashiko is not trying to replace kernel maintainers. It’s covering the ground the review process couldn’t cover at current patch volume, using a model that has read more kernel code than any individual reviewer and can apply that knowledge in parallel across every patch, every day, without fatigue.

The 53% number is not a ceiling. It’s a starting point, running against unfiltered commits with a model that will be superseded. The false positive rate will improve. The subsystem-specific prompts will get sharper.

The more interesting question is what review quality looks like in two years, when the model has seen another two years of kernel history, when the prompts have been refined against real feedback from maintainers, and when the verification stage has been tuned against a larger sample of false positives. The baseline it’s already beating is not nothing. What comes next is the part worth watching.