Back to Research

NATIVE TURKISH BPE TOKENIZER

Toward a 'Proper' Turkish Language-Native LLM: Phase 1 — Tokenization

February 2026 • Independent Research • COMPLETE

~14%
FEWER TOKENS THAN KUMRU/TABIBERT (50K)
~2.7×
FEWER TOKENS THAN GPT-4 (21 SENTENCES)
64K
VOCABULARY SIZE
22 GB
TRAINING CORPUS (27 FILES)
Abstract. Most widely used large language models rely on tokenizers designed for English. When Turkish text is processed by such systems, it is tokenized roughly 2× less efficiently — a hidden “language tax” affecting tens of millions of speakers. This report documents the construction of a purpose-built Turkish BPE tokenizer and the experiments that led to it. During development, it was found that the GPT-2 pre-tokenization regex (used by GPT-4, Llama, and Mistral) breaks Turkish apostrophe-suffix patterns — a previously undocumented interaction. Through three iterative training rounds (1.7GB → 10GB → 22GB across 11 domains) and systematic vocabulary-size experiments (16K–64K), a further finding emerged: apparent “diminishing returns” at 48K were due to vocabulary saturation, not data redundancy; moving to 64K on the same corpus yielded a 10.1% improvement. On the 21-sentence benchmark, the resulting 64K tokenizer uses ~14% fewer tokens than Kumru and TabiBERT (both ~50K vocab), and roughly 2.7× fewer than GPT-4. At 128K context length, this corresponds to the equivalent of ~149K Kumru tokens or ~352K GPT-4 tokens of Turkish text.

TABLE OF CONTENTS

1. The Problem: The Language Tax 2. Related Work 3. Novel Finding: GPT-2 Regex Breaks Turkish 4. Architecture Decisions 5. Training Corpus: 3 Iterations 6. Data & Vocabulary Scaling Experiments 7. Head-to-Head: 64K v3 vs Others 8. Domain-Specific Analysis 9. Morphological Analysis 10. Diacritic Robustness 11. Context Window: Compounding Advantage 12. Practical Implications 13. Project Status 14. Reproducibility

1. THE PROBLEM: THE LANGUAGE TAX

When Turkish text is processed by any major LLM, it passes through a tokenizer designed for English. Turkish's agglutinative structure — where meaning is packed into suffixes — is alien to these tokenizers.

Same sentence, different cost

TokenizerVocab SizeTokensRatio
Turkish 64K v3 (this work)64,00091.0x
Kumru-2B50,17691.0x
GPT-4o (o200k)200,019121.3x
GPT-4 (cl100k)100,277171.9x

Test sentence: “Türkiye Cumhuriyeti'nin başkenti Ankara'dır.”

This means every Turkish API call can cost roughly 2× more tokens. Context windows hold proportionally less Turkish text; training runs process fewer sentences per batch. The tax compounds with length.

2. RELATED WORK

Several Turkish language models and tokenizers exist. Hamza (Acikgoz, “Bridging the Bosphorus”) provides Turkish LLMs from 124M to 1.3B parameters, including models adapted from GPT-2 and Mistral; the Hamza tokenizer uses vocabulary size 50,257 — identical to GPT-2 — and is not optimized for Turkish morphology. TabiBERT (Boğaziçi Üniversitesi TabiLab) is a ModernBERT-based encoder trained on 1T tokens for Turkish NLP; vocabulary size 50,176. Kumru-2B uses a 50,176-vocabulary BPE tokenizer. LlamaTurk (METU NLP) adapts LLaMA with a 28K BPE tokenizer trained on Turkish OSCAR.

A consistent pattern is that existing Turkish decoder LLMs converge on ~50K vocabulary: 50,257 (GPT-2 size) when the base model is GPT-2, and 50,176 for others. This appears to follow from adaptation of English-base tokenizers rather than from systematic vocabulary-size experiments for Turkish. No prior work is known to the author that reports (1) the GPT-2 pre-tokenization regex breaking Turkish apostrophe suffixes, (2) vocabulary saturation versus data saturation in tokenizer training, or (3) systematic comparison of 16K–64K vocabulary on the same corpus.

3. NOVEL FINDING: GPT-2 REGEX BREAKS TURKISH

During development, it was discovered that the GPT-2 pre-tokenization regex — the same pattern used by GPT-4, GPT-4o, Llama 3, and Mistral — contains English contraction patterns ('s|'t|'re|'ve|'m|'ll|'d) that actively damage Turkish tokenization.

THE BUG: The pattern 'd (English contraction "I'd") matches the start of Turkish -dA suffixes — one of the most common suffix families in Turkish (locative, ablative). The same problem occurs with 's (conditional suffix), 't, and 'm.

How GPT-4 tokenizes Turkish apostrophe suffixes

Turkish TextGPT-4 TokenizationProblem
Ankara'dır["Ankara", "'d", "ır"]'d steals the d from "dır"
İstanbul'da["İstanbul", "'d", "a"]'d steals the d from "da"
Ali'den["Ali", "'d", "en"]'d steals the d from "den"

The fix: cleaned Turkish regex

Turkish TextCorrected TokenizationResult
Ankara'dır["Ankara", "'", "dır"]Suffix stays intact
İstanbul'da["İstanbul", "'", "da"]Suffix stays intact
Ali'den["Ali", "'", "den"]Suffix stays intact
Implementation: The solution involves a custom pre-tokenization regex with English contraction patterns stripped, chained with a byte-level encoder configured to bypass its internal regex. This two-stage approach ensures Turkish suffixes remain linguistically intact while maintaining full byte-level coverage. The critical insight is that the byte-level encoder's default behavior re-applies the problematic GPT-2 regex — a subtle interaction that must be explicitly disabled. To the author's knowledge, this specific interaction between the GPT-2 regex and Turkish morphology has not been previously documented.

4. ARCHITECTURE DECISIONS

ComponentChoiceRationale
AlgorithmByte-level BPEIndustry standard (GPT-4, Llama 3, Mistral)
NormalizationNFC UnicodeUnifies composed/decomposed forms of ç, ş, ğ, ö, ü, İ
Pre-tokenizationCustom Turkish regex + ByteLevelGPT-2 style with English contractions removed
Byte-level configInternal regex disabledPrevents re-application of problematic patterns
Special tokensLlama-3 style (7 tokens)Future instruction-tuning compatibility
Min frequency2Filters typos/noise without losing rare morphemes
LibraryHuggingFace tokenizers (Rust)Production-grade, fast training

Special tokens

TokenIDPurpose
<|begin_of_text|>0Start of document/sequence
<|end_of_text|>1End of document/sequence
<|pad|>2Padding for batch processing
<|unk|>3Unknown (safety fallback, rarely triggered)
<|start_header_id|>4Instruction tuning: role header start
<|end_header_id|>5Instruction tuning: role header end
<|eot_id|>6Instruction tuning: end of turn

5. TRAINING CORPUS: 3 ITERATIONS

The tokenizer was trained through three iterative rounds, each adding new data domains. This process revealed critical insights about the relationship between corpus diversity and tokenizer quality.

v1: Foundation (1.7 GB, 14 files)

DomainSourceSize
General KnowledgeWikipedia TR (520K articles)866 MB
CodePython corpus569 MB
ReasoningMath problems, RAG, Chain-of-thought221 MB
LiteraryTED talks, classic literature, poems, songs, folk, idioms46 MB
VocabularyTDK dictionary (full + simplified)15 MB

v2: Quality Boost (10 GB, 16 files) — curated literary & academic data added

Domain (NEW)SourceSize
Cultural/Literary WebBellaTurca ÖzenliDerlem (1.4M curated docs)4.4 GB
Academic/ThesisBellaTurca AkademikDerlem (668K papers)3.5 GB

v3: Domain Coverage (22 GB, 27 files) — 7 new specialized domains

Domain (NEW)SourceSize
News/Journalism1.8M news articles + summarization corpus4.5 GB
Legal/Law700K court decisions + Constitutional Court rulings3.7 GB
Instructions2.5M instruction-answer pairs3.7 GB
FinancialKAP announcements, capital markets (256K docs)425 MB
EducationEducation QA + MMLU exam questions (8 subjects)91 MB
MedicalMedical reasoning + hospital articles108 MB
1.7 GB
v1: 14 FILES, 5 DOMAINS
10 GB
v2: 16 FILES, 7 DOMAINS
22 GB
v3: 27 FILES, 11 DOMAINS

6. DATA & VOCABULARY SCALING EXPERIMENTS

Two systematic experiments were conducted: (1) scaling training data from 1.7GB to 22GB at fixed 48K vocabulary, and (2) scaling vocabulary from 48K to 64K on the full 22GB corpus. The combination revealed a critical insight about the interaction between data volume and vocabulary capacity.

Experiment A: Data scaling at 48K vocabulary

Sentence48k_v148k_v248k_v3Kumru
Merhaba dünya, nasılsın?6668
Evlerdekilere söyleyin, yarın geliyoruz.119912
Çekoslovakyalılaştıramadıklarımızdan mısınız?1291013
Dün akşam arkadaşlarımla buluştuk...15101015
Spinoza'nın töz ontolojisi...33293230
Sanığın mahkumiyet kararına... (legal)1211812
Anayasa Mahkemesi başvuruyu... (legal)109711
Hastanın ameliyat sonrası... (medical)10878
TOTAL (21 sentences)261235233267

Totals above use a truncated sentence set. Section 7 reports the same tokenizers on full sentences (192 / 199 / 224).

v1→v2 (1.7GB → 10GB): +10.0% improvement. v2→v3 (10GB → 22GB): +0.9% improvement — apparent diminishing returns.

Experiment B: Vocabulary scaling — the breakthrough

The near-zero improvement from v2→v3 at 48K initially suggested data saturation. However, training a 64K tokenizer on the same v3 corpus revealed a fundamentally different result:

TokenizerDataTotal Tokensvs Kumru
48k_v11.7 GB261+2.2%
48k_v210 GB235+12.0%
48k_v322 GB233+12.7%
64k_v11.7 GB247+7.5%
64k_v322 GB222+16.9%
Kumru (50k)~500 GB267baseline (truncated set)
+0.9%
48K: v2→v3 (SATURATED)
+10.1%
64K: v1→v3 (ABSORBING)
+4.7%
64K vs 48K (SAME DATA)
KEY FINDING: Vocabulary Saturation, Not Data Saturation.

The "diminishing returns" observed at 48K were not caused by redundant data — they were caused by a full vocabulary. At 48,000 merge slots, the tokenizer had no room left to encode new domain-specific patterns from the legal, medical, and financial data added in v3.

When the same 22GB corpus was used to train a 64K tokenizer, the extra 16,000 vocabulary slots absorbed the domain vocabulary that 48K had to discard, producing a 10.1% improvement (64k_v1→64k_v3) on the same data that only yielded 0.9% at 48K.

Implication: Vocabulary size and training data must be scaled together. Adding data without vocabulary capacity, or vocabulary without data diversity, both produce diminishing returns. The optimal tokenizer requires both sufficient vocabulary slots and sufficiently diverse training data to fill them.

7. HEAD-TO-HEAD: 64K v3 vs TURKISH & ENGLISH TOKENIZERS

Comparison across tokenizers on 21 test sentences covering daily speech, formal language, agglutination, code, and six specialized domains. Turkish tokenizers (this work, Kumru, TabiBERT, Hamza) were evaluated on the same full-sentence set; GPT-4/GPT-4o use different tokenizers and are included for reference.

Test Sentence64k v348k v3KumruTabiBERTHamzaGPT-4oGPT-4
Merhaba dünya, nasılsın?667714911
Türkiye Cumhuriyeti'nin başkenti Ankara'dır.9988211217
Evlerdekilere söyleyin, yarın geliyoruz.891111211218
Çekoslovakyalılaştıramadıklarımızdan mısınız?9101212291921
Görüşebileceğimizi umuyorum.5666151114
Dün akşam arkadaşlarımla buluştuk.5599202025
Edebiyatımızın en önemli eserlerinden...15161616422740
Osmanlı İmparatorluğu'nun son...12121111472843
Spinoza'nın töz ontolojisi...17171616333753
def __init__(self, value):881111988
for i in range(len(dataset)):9913131277
Makine öğrenmesi algoritmalarının...10101111362033
Büyükşehir belediyesi toplu taşıma...8988281728
İstanbul'dan Ankara'ya tren...11111111191416
2024 yılında Türkiye'nin nüfusu...11111515321526
Sanığın mahkumiyet kararına... (legal)771111261724
Anayasa Mahkemesi başvuruyu... (legal)671010231620
Hastanın ameliyat sonrası... (medical)7777301526
Şirketin halka arz sürecinde... (finance)11111111362030
Fotosentez sırasında... (science)11111212362938
Cumhurbaşkanlığı Sözcüsü basın...7888391829
TOTAL (21 sentences)192199224224568371527

Totals from benchmark_tokenizers.py on 21 full sentences. Hamza uses GPT-2 tokenizer (50,257 vocab); Kumru and TabiBERT use ~50K BPE.

Observation: Kumru and TabiBERT produce identical token counts on every sentence in this benchmark (same vocabulary size 50,176; same total 224). Exact agreement across all 21 sentences is uncommon for independently trained BPE tokenizers. The finding is reported here without further interpretation.

Extended benchmark: 104 sentences (21 core + 83 hard/edgy)

The same tokenizers were run on an extended set: the 21 core sentences above plus 83 “hard” sentences (long agglutination, legal/medical/financial phrasing, colloquial/slang, numbers and dates, code snippets, punctuation and abbreviations, loanwords, case/diacritic edge cases). All counts from benchmark_tokenizers.py.

TokenizerTotal tokens (104 sent)vs best
64k v31,041baseline (best)
48k v31,073+3.1%
32k v21,163+11.7%
16k v11,359+30.5%
Kumru1,198+15.1%
TabiBERT1,198+15.1%
Hamza2,451+135.4%

64K remains best on the extended set; Kumru and TabiBERT again match each other (1,198). The hard set includes e.g. Muvaffakiyetsizleştiricileştiriveremeyebileceklerimizdenmişsinizcesine, legal (HMK 353, tahkim), medical (pankreatikoduodenektomi, kardiyovasküler), financial (BIST 100, SPK), slang (N'olcak, bişey), code (return {'key': value}), and loanwords (Startup'lar, API endpoint'i).

192
THIS WORK (64K v3)
224
KUMRU / TABIBERT (50K)
568
HAMZA (GPT-2 TOKENIZER)
527
GPT-4 (100K)
Result: On the same 21-sentence set, the 64K tokenizer uses 192 tokens versus 224 for Kumru and TabiBERT (~14% fewer) and 568 for Hamza (~66% fewer). Hamza’s tokenizer has vocabulary size 50,257 (identical to GPT-2) and is not optimized for Turkish. The 64K tokenizer matches or outperforms GPT-4/GPT-4o on code and shows substantial gains on legal, medical, and news domains. Kumru and TabiBERT perform similarly to each other; both use ~50K BPE.

8. DOMAIN-SPECIFIC ANALYSIS

Domain-targeted training data produces measurable improvements in specialized vocabulary handling. Below are token-level comparisons for each new domain.

Legal Turkish

TokenizerTokens"Anayasa Mahkemesi başvuruyu oybirliğiyle reddetti."
64k v36Anayasa | Mahkemesi | başvuruyu | oybirliğiyle | reddetti | .
Kumru10Anayasa | Mahkemesi | başvur | uyu | oy | bir | liğiyle | reddet | ti | .
TabiBERT10(same as Kumru)
Hamza23(GPT-2 tokenizer)
GPT-420(fragmented into sub-word pieces)

başvuruyu (the application) and oybirliğiyle (unanimously) are each single tokens in 64K. Kumru and TabiBERT fragment the first into 2 pieces and the second into 3 pieces. reddetti (rejected) is also a single token — Kumru/TabiBERT need 2 (reddet | ti). Result: 6 vs 10 (Kumru/TabiBERT), 6 vs 23 (Hamza).

Medical Turkish

TokenizerTokens"Hastanın ameliyat sonrası komplikasyon riski değerlendirilmelidir."
64k v37Hastanın | ameliyat | sonrası | komplikasyon | riski | değerlendirilmelidir | .
Kumru7Hastanın | ameliyat | sonrası | komplikasyon | riski | değerlendirilmelidir | .
TabiBERT7(same as Kumru)
Hamza30(GPT-2 tokenizer)
GPT-426(fragmented into sub-word pieces)

Hastanın (of the patient) is a single token. değerlendirilmelidir (must be evaluated) — a 6-morpheme suffix chain — is also a single token. Kumru/TabiBERT tie at 7; Hamza 30; GPT-4 26.

Financial Turkish

TokenizerTokens"Şirketin halka arz sürecinde sermaye piyasası kurulu onayı gerekmektedir."
64k v311Şirket | in | halka | arz | sürecinde | sermaye | piyasası | kurulu | onayı | gerekmektedir | .
Kumru11Şirket | in | halka | arz | sürecinde | sermaye | piyasası | kurulu | onayı | gerekmektedir | .
TabiBERT11(same as Kumru)
Hamza36(GPT-2 tokenizer)
GPT-430(fragmented into sub-word pieces)

News/Journalism Turkish

TokenizerTokens"Cumhurbaşkanlığı Sözcüsü basın toplantısında açıklamalarda bulundu."
64k v37Cumhurbaşkanlığı | Sözcüsü | basın | toplantısında | açıklamalarda | bulundu | .
Kumru8Cumhurbaşkanlığı | Sözc | üsü | basın | toplantısında | açıklamalarda | bulundu | .
TabiBERT8(same as Kumru)
Hamza39(GPT-2 tokenizer)
GPT-429(fragmented into sub-word pieces)

Cumhurbaşkanlığı (Presidency) and Sözcüsü (Spokesperson) are each single tokens in 64K. Kumru and TabiBERT split Sözcüsü into 2 pieces; Hamza 39 tokens. The extra vocabulary capacity allows 64K to capture these high-frequency institutional terms as atomic units.

9. MORPHOLOGICAL ANALYSIS

The tokenizer learned Turkish morphology from pure statistics — no linguistic rules were programmed. BPE naturally discovered morpheme-like boundaries through frequency analysis of 22GB of text.

Verb morphology (learned, not coded)

WordTokensMorphological Interpretation
geliyorumgel | iyorumstem + present continuous 1st person
geldimgel | dimstem + past tense 1st person
gelecekgeleceksingle token (very common word)
gelmişgelmişsingle token (common evidential)
geliyoruzgeliyoruzsingle token (common 1st person plural)

Noun case suffixes

WordTokensCount
ev (house)ev1
evde (in the house)evde1
evden (from the house)evden1
eve (to the house)eve1
evin (of the house)evin1
evler (houses)evler1

Six different grammatical forms of "ev" — all encoded as single tokens.

Suffix chain handling

WordTokensCount
değerlendirilmelidirdeğerlendirilmelidir1
larımızdan (from our ...s)larımızdan1
gidebilirsiniz (you can go)gidebilirsiniz1
oybirliğiyle (unanimously)oybirliğiyle1

10. DIACRITIC ROBUSTNESS

Turkish users sometimes type without diacritics (c instead of ç, s instead of ş, i instead of ı). The tokenizer handles both forms, but correct Turkish is significantly more token-efficient — by design.

Correct TurkishTokensWithout DiacriticsTokensCost
şehir1sehir3+2
büyükşehir2buyuksehir6+4
Türkiye1Turkiye2+1
öğrenci1ogrenci3+2
günaydın2gunaydin3+1
Design choice: Diacritic-free Turkish was deliberately excluded from the training corpus. This ensures that input without diacritics is still processable (nothing breaks — byte-level BPE can represent anything) but costs more tokens. The model trained on this tokenizer will accept sloppy input while always producing correct Turkish in its output — because correct Turkish is all it was trained on.

11. CONTEXT WINDOW: THE COMPOUNDING ADVANTAGE

Tokenizer efficiency is not a fixed saving — it is a multiplier on context length. The longer the context window, the more the advantage compounds. This has direct architectural implications for the target model.

Effective context capacity

At each context length, the 64K tokenizer holds significantly more Turkish text than competitors could fit in the same number of token slots:

Context LengthThis Work (64K)Kumru EquivalentGPT-4 EquivalentExtra Text Capacity
2,048 tokens2,048~2,387~5,627+339 vs Kumru
4,096 tokens4,096~4,773~11,253+677 vs Kumru
32,768 tokens32,768~38,187~90,027+5,419 vs Kumru
128,000 tokens128,000~149,333~351,667+21,333 vs Kumru

“Kumru Equivalent” = how many Kumru tokens would be needed to hold the same amount of Turkish text. Calculated from the efficiency gaps measured in Section 7.

Architectural implication: small model, massive context

The efficiency advantage fundamentally changes the optimal model architecture for Turkish. Two strategies were considered:

StrategyParametersContextTurkish Text CapacityTrainability
Large model, short context7B4,096~3–4 pagesRequires 40–80 GB VRAM
Small model, long context1–2B128K~entire bookTrainable on consumer hardware
Strategic decision: A 1–2B parameter model with 128K context length is selected as the target architecture. With the 64K tokenizer, this configuration holds Turkish text equivalent to what a Kumru-tokenized model would need ~150K tokens for, or what GPT-4 would need ~303K tokens for. Most existing Turkish language models offer context lengths of 4K–8K tokens. A native-tokenized model with 128K context would be capable of processing entire legal case files, academic theses, or literary works in a single pass.

The embedding layer overhead of 64K vocabulary at 1B scale is approximately 3.3% of total parameters — a negligible cost for a permanent ~14% efficiency advantage over Kumru/TabiBERT on every token processed.

12. PRACTICAL IMPLICATIONS

~2.7×
FEWER TOKENS (vs GPT-4, 21 SENT.)
~14%
FEWER TOKENS vs KUMRU/TABIBERT
66%
FEWER TOKENS vs HAMZA (GPT-2)
11
DOMAINS COVERED

Turkish text processed by English-centric tokenizers incurs a roughly 2× token penalty across context length, speed, cost, and training efficiency. A native Turkish tokenizer eliminates this tax entirely.

The tokenizer covers 11 specialized domains (general, academic, legal, medical, financial, education, news, code, literary, reasoning, instructions), ensuring efficient tokenization regardless of subject matter.

13. PROJECT STATUS

PhaseStatusKey Result
Phase 1: TokenizerCOMPLETE64K vocab, ~14% fewer tokens than Kumru/TabiBERT, ~2.7× vs GPT-4, 11 domains
Phase 2: ArchitectureNEXT1–2B parameters, 128K context target
Phase 3: Pre-trainingNEXTLanguage learning from Turkish corpus
Phase 4: Fine-tuningNEXTInstruction following, chat capability

14. REPRODUCIBILITY

Code, data sources, and trained tokenizers are available.

Conclusion. A purpose-built Turkish tokenizer is a prerequisite for efficient Turkish language modeling. The efficiency penalty imposed by English-centric tokenizers is measurable and eliminable. Three findings are emphasized: (1) the GPT-2 pre-tokenization regex breaks Turkish apostrophe-suffix patterns — an interaction not previously documented; (2) vocabulary saturation, not data redundancy, explains apparent diminishing returns at 48K — with implications for agglutinative languages; and (3) tokenizer efficiency compounds with context length, favoring smaller, long-context architectures. This report documents Phase 1 of an independent effort to build a Turkish language–native LLM from scratch.

A special note of gratitude is owed to Kumru AI: their Turkish LLM’s well-documented limitations in reasoning and Turkish morphology provided the initial motivation to build a proper Turkish language model from scratch. Hamza (emrecanacikgoz) and TabiBERT (boun-tabilab) tokenizers were also compared; see Section 7 and benchmark_tokenizers.py.

© 2026 • Independent Research