L26 — The Operating-Alphabet Benchmark

Leaderboard · Blind condition

Every model masters English. None masters an undeclared alphabet.

In the blind Bantu tracks, models are given the syllable-building grammar but not the calibrated onset inventory — they must recover the operating alphabet from what they internalized in training. Every score is reported out of 26.

L26 · blind · Alphabet Score /26

All Proprietary Open US Non-US

#	Model	Lic.	English	Pinyin base	Pinyin toned	Bemba	Kinyarwanda	Luvale	Bantu avg	Grade
1	GPT-5.5OpenAI · US	prop	26.0	25.6	25.4	19.6	20.9	16.6	19.0	FAIL
2	Gemini 3Google · US	prop	26.0	25.7	24.1	18.1	20.6	15.3	18.0	FAIL
3	Claude Opus 4.8Anthropic · US	prop	26.0	23.9	25.4	19.6	17.5	15.3	17.5	FAIL
4	Qwen 3.7Alibaba · China	open	26.0	25.6	23.2	17.9	18.8	13.8	16.8	FAIL
5	Claude Sonnet 5Anthropic · US	prop	26.0	25.6	21.2	13.8	19.3	13.3	15.5	FAIL
6	DeepSeek V4DeepSeek · China	open	26.0	25.4	20.4	12.3	14.8	14.0	13.7	FAIL
7	Nemotron 3 UltraNVIDIA · US	open	26.0	24.3	24.3	5.5	16.5	14.9	12.3	FAIL
8	MiniMax M3MiniMax · China	open	26.0	25.4	—	10.8	9.3	15.8	12.0	FAIL
9	GLM-5.2Z.ai / Zhipu · China	open	26.0	25.2	—	5.2	17.4	—	11.3	FAIL
10	Mistral Large 3Mistral AI · France	prop	26.0	24.7	23.9	7.6	9.0	8.9	8.5	FAIL
11	Command A+Cohere · Canada	prop	26.0	16.5	9.0	8.5	10.1	0.1	6.2	FAIL
12	Grok 4.3xAI · US	prop	26.0	25.4	24.3	0.5	14.2	0.5	5.1	FAIL
13	GPT-OSS 120BOpenAI · US	open	26.0	24.6	0.0	6.2	0.0	0.1	2.1	FAIL

13 models · 152 scored cells · deterministic scoring, no LLM judge · precision-docked · L26 = 26 × recall × precision · click a column to sort. Blind = grammar given, inventory withheld.

Open vs proprietary

Open weights11.4

Proprietary12.8

Mean Bantu-blind L26. Neither side clears the bar.

US vs non-US

United States12.8

Outside the US11.4

The gap is universal — a property of the training corpus, not the flag.

What L26 measures

Closed-set mastery — reproduce the exact inventory, nothing more.

A closed set is a finite inventory with a fixed boundary. The task is not to generate plausible units; it is to reproduce the exact set. L26 evaluates six operating alphabets, and reports every one on the same 26-point scale.

Track	Operating alphabet	Inventory size	Status
English	Standard English Alphabet	26 letters	declared
Mandarin Pinyin	Base syllables	412 syllables	partly declared
Mandarin Pinyin	Toned syllables	1,642 toned syllables	partly declared
Bemba	Full Syllable Inventory	480 syllables	undeclared
Kinyarwanda	Full Syllable Inventory	490 syllables	undeclared
Luvale	Full Syllable Inventory	245 syllables	undeclared

The 26 does not mean every language has 26 units. It means every result is translated into Standard-English-Alphabet-equivalent terms — so an unfamiliar failure becomes immediately legible. It answers one question: how much of A–Z would this failure be equivalent to?

26 / 26 — mastery

25 / 26 — one-letter error

20 / 26 — six letters lost

13 / 26 — half the alphabet

The core finding

The declaredness cliff.

The failure is not random. It follows one clear boundary:

Declared inventories are recovered.
Undeclared inventories are not.

The Standard English Alphabet is declared everywhere — books, classrooms, charts, songs, primers. Mandarin Pinyin is partly declared; syllable tables exist, so models perform strongly. Bantu Full Syllable Inventories have generally never existed as public, calibrated, machine-usable closed standards — so when asked to recover them, models fail.

Every model tested masters English. Every model tested fails the Bantu Full-Syllable-Inventory mastery bar in the blind condition — proprietary and open, US and non-US alike. The leaderboard does not say models know nothing. It says something more precise: no model reaches closed-set mastery on the undeclared Bantu operating alphabets. Even the strongest Bantu result is equivalent to missing multiple letters of A–Z.

Why this matters for AGI & ASI claims

The alphabet is not advanced knowledge. It is foundation knowledge.

A model that fails a reproducible A–Z test has not failed a niche benchmark — it has failed a basic closed-set mastery test. L26 extends that same standard across languages.

If a system claims broad language intelligence, it should be able to recover the operating alphabet of a language it claims to know. If it cannot, its fluency is not the same as mastery. This does not mean the model is useless or unpowerful — it means the model has a foundation-layer gap. That gap is measurable, and L26 measures it in the simplest possible terms: how many alphabet-equivalent units did the model fail to recover?

If we would not call a model AGI after failing A–Z, why should we ignore equivalent failures in other languages?

Blind vs scaffolded

The gap is not permanent — it is a missing standard.

Blind — the real test

The model is given the syllable-building grammar but not the calibrated inventory. It must recover the operating alphabet itself. Every model fails the mastery bar here.

Scaffolded — hand over the FSI

The model is given the calibrated Full Syllable Inventory and asked to use it. Models recover sharply — some frontier models reach perfect scaffolded scores on inventories they failed blind.

This proves the gap is not permanent. The models often have the mechanical ability to work with the inventory. What they lack is the declared standard.

Methodology

Deterministic. No LLM judge. No credit for plausible-but-invalid units.

Each model output is compared against a ground-truth inventory. The score is:

L26 = 26 × recall × precision

Recall — how much of the true inventory the model recovered.
Precision — how much of the model's output was valid. A model is penalized for omissions (valid units missing) and for inventions (invalid units added).
Pass is binary: PASS = 100% recall and 100% precision. Anything less is not mastery — the same standard we apply to A–Z.
Zero contamination. Ground-truth inventories are held out; the blind condition never returns the answer key to the model.

The missing infrastructure

The benchmark is the measurement. The FSI is the infrastructure.

A calibrated Full Syllable Inventory gives a language a declared operating alphabet — it tells models, evaluators, speech systems, dictionaries, and alignment pipelines what the valid units are.

For English, that infrastructure already exists as A–Z. For Bantu languages, BantuNomics has built the equivalent at the syllable level: native-curated, standardized, versioned Full Syllable Inventories. L26 shows why that work matters. Without the FSI, models guess. With it, models can be measured, scaffolded, trained, and corrected against a standard.

AGI should not fail foundation-level alphabet mastery.

For English, frontier models pass. For undeclared Bantu Full Syllable Inventories, they fail. A score of 20/26 is a six-letter-equivalent failure. 13/26 is half an alphabet. 25/26 is still not mastery.

If that standard applies to English, it must apply to every language.

See the leaderboard Raw data (JSON)