Introducing L26 · The Operating-Alphabet Benchmark

AGI should not fail foundation-level alphabet mastery.

Today's frontier models write essays, draft code, and answer legal questions across many languages. But fluency is not mastery. L26 tests a more basic question: can a model reproduce the complete closed set of units a language is built from?

13models tested
152scored cells
6operating alphabets
0pass the blind Bantu bar

Leaderboard · Blind condition

Every model masters English. None masters an undeclared alphabet.

In the blind Bantu tracks, models are given the syllable-building grammar but not the calibrated onset inventory — they must recover the operating alphabet from what they internalized in training. Every score is reported out of 26.

L26 · blind · Alphabet Score /26
All Proprietary Open US Non-US
# Model Lic. English Pinyin base Pinyin toned Bemba Kinyarwanda Luvale Bantu avg Grade
1 GPT-5.5OpenAI · US prop 26.0 25.6 25.4 19.6 20.9 16.6 19.0 FAIL
2 Gemini 3Google · US prop 26.0 25.7 24.1 18.1 20.6 15.3 18.0 FAIL
3 Claude Opus 4.8Anthropic · US prop 26.0 23.9 25.4 19.6 17.5 15.3 17.5 FAIL
4 Qwen 3.7Alibaba · China open 26.0 25.6 23.2 17.9 18.8 13.8 16.8 FAIL
5 Claude Sonnet 5Anthropic · US prop 26.0 25.6 21.2 13.8 19.3 13.3 15.5 FAIL
6 DeepSeek V4DeepSeek · China open 26.0 25.4 20.4 12.3 14.8 14.0 13.7 FAIL
7 Nemotron 3 UltraNVIDIA · US open 26.0 24.3 24.3 5.5 16.5 14.9 12.3 FAIL
8 MiniMax M3MiniMax · China open 26.0 25.4 10.8 9.3 15.8 12.0 FAIL
9 GLM-5.2Z.ai / Zhipu · China open 26.0 25.2 5.2 17.4 11.3 FAIL
10 Mistral Large 3Mistral AI · France prop 26.0 24.7 23.9 7.6 9.0 8.9 8.5 FAIL
11 Command A+Cohere · Canada prop 26.0 16.5 9.0 8.5 10.1 0.1 6.2 FAIL
12 Grok 4.3xAI · US prop 26.0 25.4 24.3 0.5 14.2 0.5 5.1 FAIL
13 GPT-OSS 120BOpenAI · US open 26.0 24.6 0.0 6.2 0.0 0.1 2.1 FAIL
13 models · 152 scored cells · deterministic scoring, no LLM judge · precision-docked · L26 = 26 × recall × precision · click a column to sort. Blind = grammar given, inventory withheld.

Open vs proprietary

Open weights11.4
Proprietary12.8

Mean Bantu-blind L26. Neither side clears the bar.

US vs non-US

United States12.8
Outside the US11.4

The gap is universal — a property of the training corpus, not the flag.

What L26 measures

Closed-set mastery — reproduce the exact inventory, nothing more.

A closed set is a finite inventory with a fixed boundary. The task is not to generate plausible units; it is to reproduce the exact set. L26 evaluates six operating alphabets, and reports every one on the same 26-point scale.

TrackOperating alphabetInventory sizeStatus
EnglishStandard English Alphabet26 letters declared
Mandarin PinyinBase syllables412 syllables partly declared
Mandarin PinyinToned syllables1,642 toned syllables partly declared
BembaFull Syllable Inventory480 syllables undeclared
KinyarwandaFull Syllable Inventory490 syllables undeclared
LuvaleFull Syllable Inventory245 syllables undeclared

The 26 does not mean every language has 26 units. It means every result is translated into Standard-English-Alphabet-equivalent terms — so an unfamiliar failure becomes immediately legible. It answers one question: how much of A–Z would this failure be equivalent to?

26 / 26 — mastery
25 / 26 — one-letter error
20 / 26 — six letters lost
13 / 26 — half the alphabet

The core finding

The declaredness cliff.

The failure is not random. It follows one clear boundary:

Declared inventories are recovered.
Undeclared inventories are not.

The Standard English Alphabet is declared everywhere — books, classrooms, charts, songs, primers. Mandarin Pinyin is partly declared; syllable tables exist, so models perform strongly. Bantu Full Syllable Inventories have generally never existed as public, calibrated, machine-usable closed standards — so when asked to recover them, models fail.

Every model tested masters English. Every model tested fails the Bantu Full-Syllable-Inventory mastery bar in the blind condition — proprietary and open, US and non-US alike. The leaderboard does not say models know nothing. It says something more precise: no model reaches closed-set mastery on the undeclared Bantu operating alphabets. Even the strongest Bantu result is equivalent to missing multiple letters of A–Z.

Why this matters for AGI & ASI claims

The alphabet is not advanced knowledge. It is foundation knowledge.

A model that fails a reproducible A–Z test has not failed a niche benchmark — it has failed a basic closed-set mastery test. L26 extends that same standard across languages.

If a system claims broad language intelligence, it should be able to recover the operating alphabet of a language it claims to know. If it cannot, its fluency is not the same as mastery. This does not mean the model is useless or unpowerful — it means the model has a foundation-layer gap. That gap is measurable, and L26 measures it in the simplest possible terms: how many alphabet-equivalent units did the model fail to recover?

If we would not call a model AGI after failing A–Z, why should we ignore equivalent failures in other languages?

Blind vs scaffolded

The gap is not permanent — it is a missing standard.

Blind — the real test

The model is given the syllable-building grammar but not the calibrated inventory. It must recover the operating alphabet itself. Every model fails the mastery bar here.

Scaffolded — hand over the FSI

The model is given the calibrated Full Syllable Inventory and asked to use it. Models recover sharply — some frontier models reach perfect scaffolded scores on inventories they failed blind.

This proves the gap is not permanent. The models often have the mechanical ability to work with the inventory. What they lack is the declared standard.

Methodology

Deterministic. No LLM judge. No credit for plausible-but-invalid units.

Each model output is compared against a ground-truth inventory. The score is:

L26 = 26 × recall × precision

The missing infrastructure

The benchmark is the measurement. The FSI is the infrastructure.

A calibrated Full Syllable Inventory gives a language a declared operating alphabet — it tells models, evaluators, speech systems, dictionaries, and alignment pipelines what the valid units are.

For English, that infrastructure already exists as A–Z. For Bantu languages, BantuNomics has built the equivalent at the syllable level: native-curated, standardized, versioned Full Syllable Inventories. L26 shows why that work matters. Without the FSI, models guess. With it, models can be measured, scaffolded, trained, and corrected against a standard.

AGI should not fail foundation-level alphabet mastery.

For English, frontier models pass. For undeclared Bantu Full Syllable Inventories, they fail. A score of 20/26 is a six-letter-equivalent failure. 13/26 is half an alphabet. 25/26 is still not mastery.

If that standard applies to English, it must apply to every language.