Machine Translation Just Covered 1,600 Languages. Your Localisation Stack Is About to Get Simpler.

21 March 2026 - 5 mins read

Commissioned, Curated and Published by Russ. Researched and written with AI.

What’s New This Week

Meta published the Omnilingual MT paper on March 17, 2026 (arXiv: 2603.16309). It’s now picking up traction on Hacker News. The paper describes the first machine translation system benchmarked across more than 1,600 languages, led by Marta R. Costa-Jussà. The companion BOUQuET evaluation benchmark is live on HuggingFace at facebook/bouquet.

Changelog

Date	Summary
21 Mar 2026	Initial publication covering Meta’s Omnilingual MT paper and its implications for globalisation engineering.

The number that gets attention is 1,600. Meta’s new Omnilingual MT system covers more than 1,600 languages for machine translation. Their previous work – NLLB (No Language Left Behind), released in 2022 – covered around 200. That’s an 8x jump in language coverage.

The product framing writes itself: more languages, better access, digital inclusion. That’s all true. But that’s not the interesting engineering story.

The interesting story is: how do you build and evaluate a translation system for languages where almost no labelled data exists?

The NLLB Baseline and Its Limits

NLLB was already a significant achievement. Before it, most MT systems were trained on high-resource language pairs where parallel corpora are abundant – English-French, English-Spanish, English-Mandarin. Languages with smaller digital footprints were either unsupported or handled badly.

NLLB expanded coverage to around 200 languages by investing heavily in data collection for low-resource languages. Quality improved considerably. But 200 is still a small fraction of the world’s approximately 7,000 living languages.

LLMs didn’t solve this. Large language models improved translation quality considerably for languages they’d seen in pre-training. But they didn’t expand language coverage in any systematic way. If your language wasn’t in the training data, LLMs weren’t going to help you. Language coverage and translation quality turn out to be separate problems.

The Evaluation Problem

Going from 200 to 1,600 languages surfaces a harder problem than training: how do you know if the translation is any good?

For high-resource languages, automatic evaluation is straightforward. You have reference translations. You run BLEU or chrF, compare against human judgements, iterate. The feedback loop is tight.

For a language with minimal digital text, you often have no reference corpus at all. You can’t run standard automatic metrics. Human evaluation doesn’t scale to 1,600 languages. And the translators who can actually evaluate quality for rare languages are not abundant.

Meta’s answer is BOUQuET – Benchmarking Omnilingual Understanding and Quality of Evaluations for Translation. It’s a human-created evaluation set, built specifically for low-resource languages, designed to evolve toward full omnilinguality over time. The key design decision: it’s dynamically updated. The benchmark grows as coverage grows. It’s available on HuggingFace (facebook/bouquet) and includes Met-BOUQuET, a variant for automatic evaluation.

The framing in the paper is “omnilinguality by design” – the idea that building systems capable of handling every language should be a design constraint from the start, not a retrofit. BOUQuET is the artefact that makes that constraint measurable.

What This Means for Globalisation Engineering

If you work on a product that ships to multiple markets, you have a localisation stack. At minimum: string extraction, translation management, a TMS, possibly professional translators for high-stakes content, and a process for keeping translations in sync with product changes.

The economics of this stack are dominated by language coverage decisions. You support the big markets. You deprioritise or drop the small ones. The calculus is simple: professional translation per word times word count times languages is expensive, and the ROI for a language spoken by 500,000 people rarely pencils out.

MT has been eating into that calculus for years. For major languages, MT quality is now good enough for a lot of use cases – product UI, help documentation, in-app copy. The cost per word for MT is effectively zero.

What the Omnilingual MT work does is extend that dynamic into the long tail. If you have users in a language you’ve never supported because translation was cost-prohibitive, that barrier is getting lower. The 1,600-language figure includes a lot of languages where your product might have real users who are currently getting an English-only experience.

The practical implication isn’t that you should immediately ship your product in 1,600 languages. It’s that the decision to support a new language is increasingly a product decision, not a cost decision. The question shifts from “can we afford this?” to “should we prioritise this?”

The Quality Caveat

MT quality at the frontier of supported languages is not the same as MT quality for English-French. The Omnilingual MT system represents the first serious attempt to even benchmark across this range; the benchmark itself is designed to evolve as coverage improves.

For product teams: this matters. Using MT output for user-facing content in low-resource languages without human review carries real risk. The evaluation infrastructure is new, and for many of these languages there are limited feedback mechanisms to detect when translations are wrong or culturally off.

The responsible path for any team using MT in production is the same as it’s always been: build feedback mechanisms, monitor for quality signals, and invest in human review for high-stakes content. What’s changed is that this path now exists for languages where it previously didn’t.

The Broader Signal

The NLLB-to-Omnilingual jump required not just more training data but a rethink of how quality is measured at scale. The contribution isn’t just the model. It’s the evaluation infrastructure – BOUQuET – that makes it possible to track improvement systematically as coverage expands.

That’s a pattern worth watching. Benchmarks and evaluation frameworks aren’t just academic artefacts. They’re the mechanism by which a research direction becomes an engineering discipline. When you have a reliable way to measure quality across 1,600 languages, you have a foundation to build on.

The globalisation engineering problem is getting solved underneath you. If your product has a language barrier today, that barrier has a shorter shelf life than it did a year ago.

Sources: arXiv 2603.16309 – Slator coverage – HN thread