NATIVE TURKISH BPE TOKENIZER
Toward a 'Proper' Turkish Language-Native LLM: Phase 1 — Tokenization
TABLE OF CONTENTS
1. THE PROBLEM: THE LANGUAGE TAX
When Turkish text is processed by any major LLM, it passes through a tokenizer designed for English. Turkish's agglutinative structure — where meaning is packed into suffixes — is alien to these tokenizers.
Same sentence, different cost
| Tokenizer | Vocab Size | Tokens | Ratio |
|---|---|---|---|
| Turkish 64K v3 (this work) | 64,000 | 9 | 1.0x |
| Kumru-2B | 50,176 | 9 | 1.0x |
| GPT-4o (o200k) | 200,019 | 12 | 1.3x |
| GPT-4 (cl100k) | 100,277 | 17 | 1.9x |
Test sentence: “Türkiye Cumhuriyeti'nin başkenti Ankara'dır.”
This means every Turkish API call can cost roughly 2× more tokens. Context windows hold proportionally less Turkish text; training runs process fewer sentences per batch. The tax compounds with length.
2. RELATED WORK
Several Turkish language models and tokenizers exist. Hamza (Acikgoz, “Bridging the Bosphorus”) provides Turkish LLMs from 124M to 1.3B parameters, including models adapted from GPT-2 and Mistral; the Hamza tokenizer uses vocabulary size 50,257 — identical to GPT-2 — and is not optimized for Turkish morphology. TabiBERT (Boğaziçi Üniversitesi TabiLab) is a ModernBERT-based encoder trained on 1T tokens for Turkish NLP; vocabulary size 50,176. Kumru-2B uses a 50,176-vocabulary BPE tokenizer. LlamaTurk (METU NLP) adapts LLaMA with a 28K BPE tokenizer trained on Turkish OSCAR.
A consistent pattern is that existing Turkish decoder LLMs converge on ~50K vocabulary: 50,257 (GPT-2 size) when the base model is GPT-2, and 50,176 for others. This appears to follow from adaptation of English-base tokenizers rather than from systematic vocabulary-size experiments for Turkish. No prior work is known to the author that reports (1) the GPT-2 pre-tokenization regex breaking Turkish apostrophe suffixes, (2) vocabulary saturation versus data saturation in tokenizer training, or (3) systematic comparison of 16K–64K vocabulary on the same corpus.
3. NOVEL FINDING: GPT-2 REGEX BREAKS TURKISH
During development, it was discovered that the GPT-2 pre-tokenization regex — the same pattern used by GPT-4, GPT-4o, Llama 3,
and Mistral — contains English contraction patterns ('s|'t|'re|'ve|'m|'ll|'d) that actively damage Turkish tokenization.
'd (English contraction "I'd") matches the start of Turkish
-dA suffixes — one of the most common suffix families in Turkish (locative, ablative).
The same problem occurs with 's (conditional suffix), 't, and 'm.
How GPT-4 tokenizes Turkish apostrophe suffixes
| Turkish Text | GPT-4 Tokenization | Problem |
|---|---|---|
| Ankara'dır | ["Ankara", "'d", "ır"] | 'd steals the d from "dır" |
| İstanbul'da | ["İstanbul", "'d", "a"] | 'd steals the d from "da" |
| Ali'den | ["Ali", "'d", "en"] | 'd steals the d from "den" |
The fix: cleaned Turkish regex
| Turkish Text | Corrected Tokenization | Result |
|---|---|---|
| Ankara'dır | ["Ankara", "'", "dır"] | Suffix stays intact |
| İstanbul'da | ["İstanbul", "'", "da"] | Suffix stays intact |
| Ali'den | ["Ali", "'", "den"] | Suffix stays intact |
4. ARCHITECTURE DECISIONS
| Component | Choice | Rationale |
|---|---|---|
| Algorithm | Byte-level BPE | Industry standard (GPT-4, Llama 3, Mistral) |
| Normalization | NFC Unicode | Unifies composed/decomposed forms of ç, ş, ğ, ö, ü, İ |
| Pre-tokenization | Custom Turkish regex + ByteLevel | GPT-2 style with English contractions removed |
| Byte-level config | Internal regex disabled | Prevents re-application of problematic patterns |
| Special tokens | Llama-3 style (7 tokens) | Future instruction-tuning compatibility |
| Min frequency | 2 | Filters typos/noise without losing rare morphemes |
| Library | HuggingFace tokenizers (Rust) | Production-grade, fast training |
Special tokens
| Token | ID | Purpose |
|---|---|---|
<|begin_of_text|> | 0 | Start of document/sequence |
<|end_of_text|> | 1 | End of document/sequence |
<|pad|> | 2 | Padding for batch processing |
<|unk|> | 3 | Unknown (safety fallback, rarely triggered) |
<|start_header_id|> | 4 | Instruction tuning: role header start |
<|end_header_id|> | 5 | Instruction tuning: role header end |
<|eot_id|> | 6 | Instruction tuning: end of turn |
5. TRAINING CORPUS: 3 ITERATIONS
The tokenizer was trained through three iterative rounds, each adding new data domains. This process revealed critical insights about the relationship between corpus diversity and tokenizer quality.
v1: Foundation (1.7 GB, 14 files)
| Domain | Source | Size |
|---|---|---|
| General Knowledge | Wikipedia TR (520K articles) | 866 MB |
| Code | Python corpus | 569 MB |
| Reasoning | Math problems, RAG, Chain-of-thought | 221 MB |
| Literary | TED talks, classic literature, poems, songs, folk, idioms | 46 MB |
| Vocabulary | TDK dictionary (full + simplified) | 15 MB |
v2: Quality Boost (10 GB, 16 files) — curated literary & academic data added
| Domain (NEW) | Source | Size |
|---|---|---|
| Cultural/Literary Web | BellaTurca ÖzenliDerlem (1.4M curated docs) | 4.4 GB |
| Academic/Thesis | BellaTurca AkademikDerlem (668K papers) | 3.5 GB |
v3: Domain Coverage (22 GB, 27 files) — 7 new specialized domains
| Domain (NEW) | Source | Size |
|---|---|---|
| News/Journalism | 1.8M news articles + summarization corpus | 4.5 GB |
| Legal/Law | 700K court decisions + Constitutional Court rulings | 3.7 GB |
| Instructions | 2.5M instruction-answer pairs | 3.7 GB |
| Financial | KAP announcements, capital markets (256K docs) | 425 MB |
| Education | Education QA + MMLU exam questions (8 subjects) | 91 MB |
| Medical | Medical reasoning + hospital articles | 108 MB |
6. DATA & VOCABULARY SCALING EXPERIMENTS
Two systematic experiments were conducted: (1) scaling training data from 1.7GB to 22GB at fixed 48K vocabulary, and (2) scaling vocabulary from 48K to 64K on the full 22GB corpus. The combination revealed a critical insight about the interaction between data volume and vocabulary capacity.
Experiment A: Data scaling at 48K vocabulary
| Sentence | 48k_v1 | 48k_v2 | 48k_v3 | Kumru |
|---|---|---|---|---|
| Merhaba dünya, nasılsın? | 6 | 6 | 6 | 8 |
| Evlerdekilere söyleyin, yarın geliyoruz. | 11 | 9 | 9 | 12 |
| Çekoslovakyalılaştıramadıklarımızdan mısınız? | 12 | 9 | 10 | 13 |
| Dün akşam arkadaşlarımla buluştuk... | 15 | 10 | 10 | 15 |
| Spinoza'nın töz ontolojisi... | 33 | 29 | 32 | 30 |
| Sanığın mahkumiyet kararına... (legal) | 12 | 11 | 8 | 12 |
| Anayasa Mahkemesi başvuruyu... (legal) | 10 | 9 | 7 | 11 |
| Hastanın ameliyat sonrası... (medical) | 10 | 8 | 7 | 8 |
| TOTAL (21 sentences) | 261 | 235 | 233 | 267 |
Totals above use a truncated sentence set. Section 7 reports the same tokenizers on full sentences (192 / 199 / 224).
v1→v2 (1.7GB → 10GB): +10.0% improvement. v2→v3 (10GB → 22GB): +0.9% improvement — apparent diminishing returns.
Experiment B: Vocabulary scaling — the breakthrough
The near-zero improvement from v2→v3 at 48K initially suggested data saturation. However, training a 64K tokenizer on the same v3 corpus revealed a fundamentally different result:
| Tokenizer | Data | Total Tokens | vs Kumru |
|---|---|---|---|
| 48k_v1 | 1.7 GB | 261 | +2.2% |
| 48k_v2 | 10 GB | 235 | +12.0% |
| 48k_v3 | 22 GB | 233 | +12.7% |
| 64k_v1 | 1.7 GB | 247 | +7.5% |
| 64k_v3 | 22 GB | 222 | +16.9% |
| Kumru (50k) | ~500 GB | 267 | baseline (truncated set) |
The "diminishing returns" observed at 48K were not caused by redundant data — they were caused by a full vocabulary. At 48,000 merge slots, the tokenizer had no room left to encode new domain-specific patterns from the legal, medical, and financial data added in v3.
When the same 22GB corpus was used to train a 64K tokenizer, the extra 16,000 vocabulary slots absorbed the domain vocabulary that 48K had to discard, producing a 10.1% improvement (64k_v1→64k_v3) on the same data that only yielded 0.9% at 48K.
Implication: Vocabulary size and training data must be scaled together. Adding data without vocabulary capacity, or vocabulary without data diversity, both produce diminishing returns. The optimal tokenizer requires both sufficient vocabulary slots and sufficiently diverse training data to fill them.
7. HEAD-TO-HEAD: 64K v3 vs TURKISH & ENGLISH TOKENIZERS
Comparison across tokenizers on 21 test sentences covering daily speech, formal language, agglutination, code, and six specialized domains. Turkish tokenizers (this work, Kumru, TabiBERT, Hamza) were evaluated on the same full-sentence set; GPT-4/GPT-4o use different tokenizers and are included for reference.
| Test Sentence | 64k v3 | 48k v3 | Kumru | TabiBERT | Hamza | GPT-4o | GPT-4 |
|---|---|---|---|---|---|---|---|
| Merhaba dünya, nasılsın? | 6 | 6 | 7 | 7 | 14 | 9 | 11 |
| Türkiye Cumhuriyeti'nin başkenti Ankara'dır. | 9 | 9 | 8 | 8 | 21 | 12 | 17 |
| Evlerdekilere söyleyin, yarın geliyoruz. | 8 | 9 | 11 | 11 | 21 | 12 | 18 |
| Çekoslovakyalılaştıramadıklarımızdan mısınız? | 9 | 10 | 12 | 12 | 29 | 19 | 21 |
| Görüşebileceğimizi umuyorum. | 5 | 6 | 6 | 6 | 15 | 11 | 14 |
| Dün akşam arkadaşlarımla buluştuk. | 5 | 5 | 9 | 9 | 20 | 20 | 25 |
| Edebiyatımızın en önemli eserlerinden... | 15 | 16 | 16 | 16 | 42 | 27 | 40 |
| Osmanlı İmparatorluğu'nun son... | 12 | 12 | 11 | 11 | 47 | 28 | 43 |
| Spinoza'nın töz ontolojisi... | 17 | 17 | 16 | 16 | 33 | 37 | 53 |
| def __init__(self, value): | 8 | 8 | 11 | 11 | 9 | 8 | 8 |
| for i in range(len(dataset)): | 9 | 9 | 13 | 13 | 12 | 7 | 7 |
| Makine öğrenmesi algoritmalarının... | 10 | 10 | 11 | 11 | 36 | 20 | 33 |
| Büyükşehir belediyesi toplu taşıma... | 8 | 9 | 8 | 8 | 28 | 17 | 28 |
| İstanbul'dan Ankara'ya tren... | 11 | 11 | 11 | 11 | 19 | 14 | 16 |
| 2024 yılında Türkiye'nin nüfusu... | 11 | 11 | 15 | 15 | 32 | 15 | 26 |
| Sanığın mahkumiyet kararına... (legal) | 7 | 7 | 11 | 11 | 26 | 17 | 24 |
| Anayasa Mahkemesi başvuruyu... (legal) | 6 | 7 | 10 | 10 | 23 | 16 | 20 |
| Hastanın ameliyat sonrası... (medical) | 7 | 7 | 7 | 7 | 30 | 15 | 26 |
| Şirketin halka arz sürecinde... (finance) | 11 | 11 | 11 | 11 | 36 | 20 | 30 |
| Fotosentez sırasında... (science) | 11 | 11 | 12 | 12 | 36 | 29 | 38 |
| Cumhurbaşkanlığı Sözcüsü basın... | 7 | 8 | 8 | 8 | 39 | 18 | 29 |
| TOTAL (21 sentences) | 192 | 199 | 224 | 224 | 568 | 371 | 527 |
Totals from benchmark_tokenizers.py on 21 full sentences. Hamza uses GPT-2 tokenizer (50,257 vocab); Kumru and TabiBERT use ~50K BPE.
Observation: Kumru and TabiBERT produce identical token counts on every sentence in this benchmark (same vocabulary size 50,176; same total 224). Exact agreement across all 21 sentences is uncommon for independently trained BPE tokenizers. The finding is reported here without further interpretation.
Extended benchmark: 104 sentences (21 core + 83 hard/edgy)
The same tokenizers were run on an extended set: the 21 core sentences above plus 83 “hard” sentences
(long agglutination, legal/medical/financial phrasing, colloquial/slang, numbers and dates, code snippets,
punctuation and abbreviations, loanwords, case/diacritic edge cases). All counts from benchmark_tokenizers.py.
| Tokenizer | Total tokens (104 sent) | vs best |
|---|---|---|
| 64k v3 | 1,041 | baseline (best) |
| 48k v3 | 1,073 | +3.1% |
| 32k v2 | 1,163 | +11.7% |
| 16k v1 | 1,359 | +30.5% |
| Kumru | 1,198 | +15.1% |
| TabiBERT | 1,198 | +15.1% |
| Hamza | 2,451 | +135.4% |
64K remains best on the extended set; Kumru and TabiBERT again match each other (1,198).
The hard set includes e.g. Muvaffakiyetsizleştiricileştiriveremeyebileceklerimizdenmişsinizcesine,
legal (HMK 353, tahkim), medical (pankreatikoduodenektomi, kardiyovasküler), financial (BIST 100, SPK),
slang (N'olcak, bişey), code (return {'key': value}), and loanwords (Startup'lar, API endpoint'i).
8. DOMAIN-SPECIFIC ANALYSIS
Domain-targeted training data produces measurable improvements in specialized vocabulary handling. Below are token-level comparisons for each new domain.
Legal Turkish
| Tokenizer | Tokens | "Anayasa Mahkemesi başvuruyu oybirliğiyle reddetti." |
|---|---|---|
| 64k v3 | 6 | Anayasa | Mahkemesi | başvuruyu | oybirliğiyle | reddetti | . |
| Kumru | 10 | Anayasa | Mahkemesi | başvur | uyu | oy | bir | liğiyle | reddet | ti | . |
| TabiBERT | 10 | (same as Kumru) |
| Hamza | 23 | (GPT-2 tokenizer) |
| GPT-4 | 20 | (fragmented into sub-word pieces) |
başvuruyu (the application) and oybirliğiyle (unanimously) are each single tokens
in 64K. Kumru and TabiBERT fragment the first into 2 pieces and the second into 3 pieces. reddetti (rejected) is also
a single token — Kumru/TabiBERT need 2 (reddet | ti). Result: 6 vs 10 (Kumru/TabiBERT),
6 vs 23 (Hamza).
Medical Turkish
| Tokenizer | Tokens | "Hastanın ameliyat sonrası komplikasyon riski değerlendirilmelidir." |
|---|---|---|
| 64k v3 | 7 | Hastanın | ameliyat | sonrası | komplikasyon | riski | değerlendirilmelidir | . |
| Kumru | 7 | Hastanın | ameliyat | sonrası | komplikasyon | riski | değerlendirilmelidir | . |
| TabiBERT | 7 | (same as Kumru) |
| Hamza | 30 | (GPT-2 tokenizer) |
| GPT-4 | 26 | (fragmented into sub-word pieces) |
Hastanın (of the patient) is a single token. değerlendirilmelidir (must be evaluated) —
a 6-morpheme suffix chain — is also a single token. Kumru/TabiBERT tie at 7; Hamza 30; GPT-4 26.
Financial Turkish
| Tokenizer | Tokens | "Şirketin halka arz sürecinde sermaye piyasası kurulu onayı gerekmektedir." |
|---|---|---|
| 64k v3 | 11 | Şirket | in | halka | arz | sürecinde | sermaye | piyasası | kurulu | onayı | gerekmektedir | . |
| Kumru | 11 | Şirket | in | halka | arz | sürecinde | sermaye | piyasası | kurulu | onayı | gerekmektedir | . |
| TabiBERT | 11 | (same as Kumru) |
| Hamza | 36 | (GPT-2 tokenizer) |
| GPT-4 | 30 | (fragmented into sub-word pieces) |
News/Journalism Turkish
| Tokenizer | Tokens | "Cumhurbaşkanlığı Sözcüsü basın toplantısında açıklamalarda bulundu." |
|---|---|---|
| 64k v3 | 7 | Cumhurbaşkanlığı | Sözcüsü | basın | toplantısında | açıklamalarda | bulundu | . |
| Kumru | 8 | Cumhurbaşkanlığı | Sözc | üsü | basın | toplantısında | açıklamalarda | bulundu | . |
| TabiBERT | 8 | (same as Kumru) |
| Hamza | 39 | (GPT-2 tokenizer) |
| GPT-4 | 29 | (fragmented into sub-word pieces) |
Cumhurbaşkanlığı (Presidency) and Sözcüsü (Spokesperson) are each
single tokens in 64K. Kumru and TabiBERT split Sözcüsü into 2 pieces; Hamza 39 tokens.
The extra vocabulary capacity allows 64K to capture these high-frequency institutional terms as atomic units.
9. MORPHOLOGICAL ANALYSIS
The tokenizer learned Turkish morphology from pure statistics — no linguistic rules were programmed. BPE naturally discovered morpheme-like boundaries through frequency analysis of 22GB of text.
Verb morphology (learned, not coded)
| Word | Tokens | Morphological Interpretation |
|---|---|---|
| geliyorum | gel | iyorum | stem + present continuous 1st person |
| geldim | gel | dim | stem + past tense 1st person |
| gelecek | gelecek | single token (very common word) |
| gelmiş | gelmiş | single token (common evidential) |
| geliyoruz | geliyoruz | single token (common 1st person plural) |
Noun case suffixes
| Word | Tokens | Count |
|---|---|---|
| ev (house) | ev | 1 |
| evde (in the house) | evde | 1 |
| evden (from the house) | evden | 1 |
| eve (to the house) | eve | 1 |
| evin (of the house) | evin | 1 |
| evler (houses) | evler | 1 |
Six different grammatical forms of "ev" — all encoded as single tokens.
Suffix chain handling
| Word | Tokens | Count |
|---|---|---|
| değerlendirilmelidir | değerlendirilmelidir | 1 |
| larımızdan (from our ...s) | larımızdan | 1 |
| gidebilirsiniz (you can go) | gidebilirsiniz | 1 |
| oybirliğiyle (unanimously) | oybirliğiyle | 1 |
10. DIACRITIC ROBUSTNESS
Turkish users sometimes type without diacritics (c instead of ç, s instead of ş, i instead of ı). The tokenizer handles both forms, but correct Turkish is significantly more token-efficient — by design.
| Correct Turkish | Tokens | Without Diacritics | Tokens | Cost |
|---|---|---|---|---|
| şehir | 1 | sehir | 3 | +2 |
| büyükşehir | 2 | buyuksehir | 6 | +4 |
| Türkiye | 1 | Turkiye | 2 | +1 |
| öğrenci | 1 | ogrenci | 3 | +2 |
| günaydın | 2 | gunaydin | 3 | +1 |
11. CONTEXT WINDOW: THE COMPOUNDING ADVANTAGE
Tokenizer efficiency is not a fixed saving — it is a multiplier on context length. The longer the context window, the more the advantage compounds. This has direct architectural implications for the target model.
Effective context capacity
At each context length, the 64K tokenizer holds significantly more Turkish text than competitors could fit in the same number of token slots:
| Context Length | This Work (64K) | Kumru Equivalent | GPT-4 Equivalent | Extra Text Capacity |
|---|---|---|---|---|
| 2,048 tokens | 2,048 | ~2,387 | ~5,627 | +339 vs Kumru |
| 4,096 tokens | 4,096 | ~4,773 | ~11,253 | +677 vs Kumru |
| 32,768 tokens | 32,768 | ~38,187 | ~90,027 | +5,419 vs Kumru |
| 128,000 tokens | 128,000 | ~149,333 | ~351,667 | +21,333 vs Kumru |
“Kumru Equivalent” = how many Kumru tokens would be needed to hold the same amount of Turkish text. Calculated from the efficiency gaps measured in Section 7.
Architectural implication: small model, massive context
The efficiency advantage fundamentally changes the optimal model architecture for Turkish. Two strategies were considered:
| Strategy | Parameters | Context | Turkish Text Capacity | Trainability |
|---|---|---|---|---|
| Large model, short context | 7B | 4,096 | ~3–4 pages | Requires 40–80 GB VRAM |
| Small model, long context | 1–2B | 128K | ~entire book | Trainable on consumer hardware |
The embedding layer overhead of 64K vocabulary at 1B scale is approximately 3.3% of total parameters — a negligible cost for a permanent ~14% efficiency advantage over Kumru/TabiBERT on every token processed.
12. PRACTICAL IMPLICATIONS
Turkish text processed by English-centric tokenizers incurs a roughly 2× token penalty across context length, speed, cost, and training efficiency. A native Turkish tokenizer eliminates this tax entirely.
The tokenizer covers 11 specialized domains (general, academic, legal, medical, financial, education, news, code, literary, reasoning, instructions), ensuring efficient tokenization regardless of subject matter.
13. PROJECT STATUS
| Phase | Status | Key Result |
|---|---|---|
| Phase 1: Tokenizer | COMPLETE | 64K vocab, ~14% fewer tokens than Kumru/TabiBERT, ~2.7× vs GPT-4, 11 domains |
| Phase 2: Architecture | NEXT | 1–2B parameters, 128K context target |
| Phase 3: Pre-training | NEXT | Language learning from Turkish corpus |
| Phase 4: Fine-tuning | NEXT | Instruction following, chat capability |
14. REPRODUCIBILITY
Code, data sources, and trained tokenizers are available.
- Training script:
train_tokenizer.py - Benchmark script:
benchmark_tokenizers.py(104-sentence comparison: 21 core + 83 hard/edgy) - Training data: 22 GB across 27 files, 11 domains
- Selected tokenizer:
tokenizers/turkish_bpe_64k/tokenizer.json - Versions preserved: 16K, 32K, 48K, 64K × v1/v2/v3
- Baselines: Kumru-2B (50,176), TabiBERT (50,176), Hamza (50,257, GPT-2 tokenizer), GPT-4 (cl100k_base), GPT-4o (o200k_base)
A special note of gratitude is owed to Kumru AI: their Turkish LLM’s well-documented limitations in reasoning and Turkish morphology provided the initial motivation to build a proper Turkish language model from scratch. Hamza (emrecanacikgoz) and TabiBERT (boun-tabilab) tokenizers were also compared; see Section 7 and
benchmark_tokenizers.py.
© 2026 • Independent Research