Today's frontier models write essays, draft code, and answer legal questions across many languages. But fluency is not mastery. L26 tests a more basic question: can a model reproduce the complete closed set of units a language is built from?
Leaderboard · Blind condition
In the blind Bantu tracks, models are given the syllable-building grammar but not the calibrated onset inventory — they must recover the operating alphabet from what they internalized in training. Every score is reported out of 26.
| # | Model | Lic. | English | Pinyin base | Pinyin toned | Bemba | Kinyarwanda | Luvale | Bantu avg | Grade |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GPT-5.5OpenAI · US | prop | 26.0 | 25.6 | 25.4 | 19.6 | 20.9 | 16.6 | 19.0 | FAIL |
| 2 | Gemini 3Google · US | prop | 26.0 | 25.7 | 24.1 | 18.1 | 20.6 | 15.3 | 18.0 | FAIL |
| 3 | Claude Opus 4.8Anthropic · US | prop | 26.0 | 23.9 | 25.4 | 19.6 | 17.5 | 15.3 | 17.5 | FAIL |
| 4 | Qwen 3.7Alibaba · China | open | 26.0 | 25.6 | 23.2 | 17.9 | 18.8 | 13.8 | 16.8 | FAIL |
| 5 | Claude Sonnet 5Anthropic · US | prop | 26.0 | 25.6 | 21.2 | 13.8 | 19.3 | 13.3 | 15.5 | FAIL |
| 6 | DeepSeek V4DeepSeek · China | open | 26.0 | 25.4 | 20.4 | 12.3 | 14.8 | 14.0 | 13.7 | FAIL |
| 7 | Nemotron 3 UltraNVIDIA · US | open | 26.0 | 24.3 | 24.3 | 5.5 | 16.5 | 14.9 | 12.3 | FAIL |
| 8 | MiniMax M3MiniMax · China | open | 26.0 | 25.4 | — | 10.8 | 9.3 | 15.8 | 12.0 | FAIL |
| 9 | GLM-5.2Z.ai / Zhipu · China | open | 26.0 | 25.2 | — | 5.2 | 17.4 | — | 11.3 | FAIL |
| 10 | Mistral Large 3Mistral AI · France | prop | 26.0 | 24.7 | 23.9 | 7.6 | 9.0 | 8.9 | 8.5 | FAIL |
| 11 | Command A+Cohere · Canada | prop | 26.0 | 16.5 | 9.0 | 8.5 | 10.1 | 0.1 | 6.2 | FAIL |
| 12 | Grok 4.3xAI · US | prop | 26.0 | 25.4 | 24.3 | 0.5 | 14.2 | 0.5 | 5.1 | FAIL |
| 13 | GPT-OSS 120BOpenAI · US | open | 26.0 | 24.6 | 0.0 | 6.2 | 0.0 | 0.1 | 2.1 | FAIL |
Mean Bantu-blind L26. Neither side clears the bar.
The gap is universal — a property of the training corpus, not the flag.
What L26 measures
A closed set is a finite inventory with a fixed boundary. The task is not to generate plausible units; it is to reproduce the exact set. L26 evaluates six operating alphabets, and reports every one on the same 26-point scale.
| Track | Operating alphabet | Inventory size | Status |
|---|---|---|---|
| English | Standard English Alphabet | 26 letters | declared |
| Mandarin Pinyin | Base syllables | 412 syllables | partly declared |
| Mandarin Pinyin | Toned syllables | 1,642 toned syllables | partly declared |
| Bemba | Full Syllable Inventory | 480 syllables | undeclared |
| Kinyarwanda | Full Syllable Inventory | 490 syllables | undeclared |
| Luvale | Full Syllable Inventory | 245 syllables | undeclared |
The 26 does not mean every language has 26 units. It means every result is translated into Standard-English-Alphabet-equivalent terms — so an unfamiliar failure becomes immediately legible. It answers one question: how much of A–Z would this failure be equivalent to?
The core finding
The failure is not random. It follows one clear boundary:
The Standard English Alphabet is declared everywhere — books, classrooms, charts, songs, primers. Mandarin Pinyin is partly declared; syllable tables exist, so models perform strongly. Bantu Full Syllable Inventories have generally never existed as public, calibrated, machine-usable closed standards — so when asked to recover them, models fail.
Every model tested masters English. Every model tested fails the Bantu Full-Syllable-Inventory mastery bar in the blind condition — proprietary and open, US and non-US alike. The leaderboard does not say models know nothing. It says something more precise: no model reaches closed-set mastery on the undeclared Bantu operating alphabets. Even the strongest Bantu result is equivalent to missing multiple letters of A–Z.
Why this matters for AGI & ASI claims
A model that fails a reproducible A–Z test has not failed a niche benchmark — it has failed a basic closed-set mastery test. L26 extends that same standard across languages.
If a system claims broad language intelligence, it should be able to recover the operating alphabet of a language it claims to know. If it cannot, its fluency is not the same as mastery. This does not mean the model is useless or unpowerful — it means the model has a foundation-layer gap. That gap is measurable, and L26 measures it in the simplest possible terms: how many alphabet-equivalent units did the model fail to recover?
If we would not call a model AGI after failing A–Z, why should we ignore equivalent failures in other languages?
Blind vs scaffolded
The model is given the syllable-building grammar but not the calibrated inventory. It must recover the operating alphabet itself. Every model fails the mastery bar here.
The model is given the calibrated Full Syllable Inventory and asked to use it. Models recover sharply — some frontier models reach perfect scaffolded scores on inventories they failed blind.
This proves the gap is not permanent. The models often have the mechanical ability to work with the inventory. What they lack is the declared standard.
Methodology
Each model output is compared against a ground-truth inventory. The score is:
L26 = 26 × recall × precision
The missing infrastructure
A calibrated Full Syllable Inventory gives a language a declared operating alphabet — it tells models, evaluators, speech systems, dictionaries, and alignment pipelines what the valid units are.
For English, that infrastructure already exists as A–Z. For Bantu languages, BantuNomics has built the equivalent at the syllable level: native-curated, standardized, versioned Full Syllable Inventories. L26 shows why that work matters. Without the FSI, models guess. With it, models can be measured, scaffolded, trained, and corrected against a standard.
For English, frontier models pass. For undeclared Bantu Full Syllable Inventories, they fail. A score of 20/26 is a six-letter-equivalent failure. 13/26 is half an alphabet. 25/26 is still not mastery.
If that standard applies to English, it must apply to every language.