Back to Research

NATIVE TURKISH BPE TOKENIZER

Toward a 'Proper' Turkish Language-Native LLM: Phase 1 — Tokenization

February 2026 • Independent Research • COMPLETE

~14%

FEWER TOKENS THAN KUMRU/TABIBERT (50K)

~2.7×

FEWER TOKENS THAN GPT-4 (21 SENTENCES)

64K

VOCABULARY SIZE

22 GB

TRAINING CORPUS (27 FILES)

Abstract. Most widely used large language models rely on tokenizers designed for English. When Turkish text is processed by such systems, it is tokenized roughly 2× less efficiently — a hidden “language tax” affecting tens of millions of speakers. This report documents the construction of a purpose-built Turkish BPE tokenizer and the experiments that led to it. During development, it was found that the GPT-2 pre-tokenization regex (used by GPT-4, Llama, and Mistral) breaks Turkish apostrophe-suffix patterns — a previously undocumented interaction. Through three iterative training rounds (1.7GB → 10GB → 22GB across 11 domains) and systematic vocabulary-size experiments (16K–64K), a further finding emerged: apparent “diminishing returns” at 48K were due to vocabulary saturation, not data redundancy; moving to 64K on the same corpus yielded a 10.1% improvement. On the 21-sentence benchmark, the resulting 64K tokenizer uses ~14% fewer tokens than Kumru and TabiBERT (both ~50K vocab), and roughly 2.7× fewer than GPT-4. At 128K context length, this corresponds to the equivalent of ~149K Kumru tokens or ~352K GPT-4 tokens of Turkish text.

1. The Problem: The Language Tax 2. Related Work 3. Novel Finding: GPT-2 Regex Breaks Turkish 4. Architecture Decisions 5. Training Corpus: 3 Iterations 6. Data & Vocabulary Scaling Experiments 7. Head-to-Head: 64K v3 vs Others 8. Domain-Specific Analysis 9. Morphological Analysis 10. Diacritic Robustness 11. Context Window: Compounding Advantage 12. Practical Implications 13. Project Status 14. Reproducibility

1. THE PROBLEM: THE LANGUAGE TAX

When Turkish text is processed by any major LLM, it passes through a tokenizer designed for English. Turkish's agglutinative structure — where meaning is packed into suffixes — is alien to these tokenizers.

Same sentence, different cost

Tokenizer	Vocab Size	Tokens	Ratio
Turkish 64K v3 (this work)	64,000	9	1.0x
Kumru-2B	50,176	9	1.0x
GPT-4o (o200k)	200,019	12	1.3x
GPT-4 (cl100k)	100,277	17	1.9x

Test sentence: “Türkiye Cumhuriyeti'nin başkenti Ankara'dır.”

This means every Turkish API call can cost roughly 2× more tokens. Context windows hold proportionally less Turkish text; training runs process fewer sentences per batch. The tax compounds with length.

2. RELATED WORK

Several Turkish language models and tokenizers exist. Hamza (Acikgoz, “Bridging the Bosphorus”) provides Turkish LLMs from 124M to 1.3B parameters, including models adapted from GPT-2 and Mistral; the Hamza tokenizer uses vocabulary size 50,257 — identical to GPT-2 — and is not optimized for Turkish morphology. TabiBERT (Boğaziçi Üniversitesi TabiLab) is a ModernBERT-based encoder trained on 1T tokens for Turkish NLP; vocabulary size 50,176. Kumru-2B uses a 50,176-vocabulary BPE tokenizer. LlamaTurk (METU NLP) adapts LLaMA with a 28K BPE tokenizer trained on Turkish OSCAR.

A consistent pattern is that existing Turkish decoder LLMs converge on ~50K vocabulary: 50,257 (GPT-2 size) when the base model is GPT-2, and 50,176 for others. This appears to follow from adaptation of English-base tokenizers rather than from systematic vocabulary-size experiments for Turkish. No prior work is known to the author that reports (1) the GPT-2 pre-tokenization regex breaking Turkish apostrophe suffixes, (2) vocabulary saturation versus data saturation in tokenizer training, or (3) systematic comparison of 16K–64K vocabulary on the same corpus.

3. NOVEL FINDING: GPT-2 REGEX BREAKS TURKISH

During development, it was discovered that the GPT-2 pre-tokenization regex — the same pattern used by GPT-4, GPT-4o, Llama 3, and Mistral — contains English contraction patterns ('s|'t|'re|'ve|'m|'ll|'d) that actively damage Turkish tokenization.

THE BUG: The pattern 'd (English contraction "I'd") matches the start of Turkish -dA suffixes — one of the most common suffix families in Turkish (locative, ablative). The same problem occurs with 's (conditional suffix), 't, and 'm.

How GPT-4 tokenizes Turkish apostrophe suffixes

Turkish Text	GPT-4 Tokenization	Problem
Ankara'dır	`["Ankara", "'d", "ır"]`	'd steals the d from "dır"
İstanbul'da	`["İstanbul", "'d", "a"]`	'd steals the d from "da"
Ali'den	`["Ali", "'d", "en"]`	'd steals the d from "den"

The fix: cleaned Turkish regex

Turkish Text	Corrected Tokenization	Result
Ankara'dır	`["Ankara", "'", "dır"]`	Suffix stays intact
İstanbul'da	`["İstanbul", "'", "da"]`	Suffix stays intact
Ali'den	`["Ali", "'", "den"]`	Suffix stays intact

Implementation: The solution involves a custom pre-tokenization regex with English contraction patterns stripped, chained with a byte-level encoder configured to bypass its internal regex. This two-stage approach ensures Turkish suffixes remain linguistically intact while maintaining full byte-level coverage. The critical insight is that the byte-level encoder's default behavior re-applies the problematic GPT-2 regex — a subtle interaction that must be explicitly disabled. To the author's knowledge, this specific interaction between the GPT-2 regex and Turkish morphology has not been previously documented.

4. ARCHITECTURE DECISIONS

Component	Choice	Rationale
Algorithm	Byte-level BPE	Industry standard (GPT-4, Llama 3, Mistral)
Normalization	NFC Unicode	Unifies composed/decomposed forms of ç, ş, ğ, ö, ü, İ
Pre-tokenization	Custom Turkish regex + ByteLevel	GPT-2 style with English contractions removed
Byte-level config	Internal regex disabled	Prevents re-application of problematic patterns
Special tokens	Llama-3 style (7 tokens)	Future instruction-tuning compatibility
Min frequency	2	Filters typos/noise without losing rare morphemes
Library	HuggingFace tokenizers (Rust)	Production-grade, fast training

Special tokens

Token	ID	Purpose
`<\|begin_of_text\|>`	0	Start of document/sequence
`<\|end_of_text\|>`	1	End of document/sequence
`<\|pad\|>`	2	Padding for batch processing
`<\|unk\|>`	3	Unknown (safety fallback, rarely triggered)
`<\|start_header_id\|>`	4	Instruction tuning: role header start
`<\|end_header_id\|>`	5	Instruction tuning: role header end
`<\|eot_id\|>`	6	Instruction tuning: end of turn

5. TRAINING CORPUS: 3 ITERATIONS

The tokenizer was trained through three iterative rounds, each adding new data domains. This process revealed critical insights about the relationship between corpus diversity and tokenizer quality.

v1: Foundation (1.7 GB, 14 files)

Domain	Source	Size
General Knowledge	Wikipedia TR (520K articles)	866 MB
Code	Python corpus	569 MB
Reasoning	Math problems, RAG, Chain-of-thought	221 MB
Literary	TED talks, classic literature, poems, songs, folk, idioms	46 MB
Vocabulary	TDK dictionary (full + simplified)	15 MB

v2: Quality Boost (10 GB, 16 files) — curated literary & academic data added

Domain (NEW)	Source	Size
Cultural/Literary Web	BellaTurca ÖzenliDerlem (1.4M curated docs)	4.4 GB
Academic/Thesis	BellaTurca AkademikDerlem (668K papers)	3.5 GB

v3: Domain Coverage (22 GB, 27 files) — 7 new specialized domains

Domain (NEW)	Source	Size
News/Journalism	1.8M news articles + summarization corpus	4.5 GB
Legal/Law	700K court decisions + Constitutional Court rulings	3.7 GB
Instructions	2.5M instruction-answer pairs	3.7 GB
Financial	KAP announcements, capital markets (256K docs)	425 MB
Education	Education QA + MMLU exam questions (8 subjects)	91 MB
Medical	Medical reasoning + hospital articles	108 MB

1.7 GB

v1: 14 FILES, 5 DOMAINS

10 GB

v2: 16 FILES, 7 DOMAINS

22 GB

v3: 27 FILES, 11 DOMAINS

6. DATA & VOCABULARY SCALING EXPERIMENTS

Two systematic experiments were conducted: (1) scaling training data from 1.7GB to 22GB at fixed 48K vocabulary, and (2) scaling vocabulary from 48K to 64K on the full 22GB corpus. The combination revealed a critical insight about the interaction between data volume and vocabulary capacity.

Experiment A: Data scaling at 48K vocabulary

Sentence	48k_v1	48k_v2	48k_v3	Kumru
Merhaba dünya, nasılsın?	6	6	6	8
Evlerdekilere söyleyin, yarın geliyoruz.	11	9	9	12
Çekoslovakyalılaştıramadıklarımızdan mısınız?	12	9	10	13
Dün akşam arkadaşlarımla buluştuk...	15	10	10	15
Spinoza'nın töz ontolojisi...	33	29	32	30
Sanığın mahkumiyet kararına... (legal)	12	11	8	12
Anayasa Mahkemesi başvuruyu... (legal)	10	9	7	11
Hastanın ameliyat sonrası... (medical)	10	8	7	8
TOTAL (21 sentences)	261	235	233	267

Totals above use a truncated sentence set. Section 7 reports the same tokenizers on full sentences (192 / 199 / 224).

v1→v2 (1.7GB → 10GB): +10.0% improvement. v2→v3 (10GB → 22GB): +0.9% improvement — apparent diminishing returns.

Experiment B: Vocabulary scaling — the breakthrough

The near-zero improvement from v2→v3 at 48K initially suggested data saturation. However, training a 64K tokenizer on the same v3 corpus revealed a fundamentally different result:

Tokenizer	Data	Total Tokens	vs Kumru
48k_v1	1.7 GB	261	+2.2%
48k_v2	10 GB	235	+12.0%
48k_v3	22 GB	233	+12.7%
64k_v1	1.7 GB	247	+7.5%
64k_v3	22 GB	222	+16.9%
Kumru (50k)	~500 GB	267	baseline (truncated set)

+0.9%

48K: v2→v3 (SATURATED)

+10.1%

64K: v1→v3 (ABSORBING)

+4.7%

64K vs 48K (SAME DATA)

KEY FINDING: Vocabulary Saturation, Not Data Saturation.

The "diminishing returns" observed at 48K were not caused by redundant data — they were caused by a full vocabulary. At 48,000 merge slots, the tokenizer had no room left to encode new domain-specific patterns from the legal, medical, and financial data added in v3.

When the same 22GB corpus was used to train a 64K tokenizer, the extra 16,000 vocabulary slots absorbed the domain vocabulary that 48K had to discard, producing a 10.1% improvement (64k_v1→64k_v3) on the same data that only yielded 0.9% at 48K.

Implication: Vocabulary size and training data must be scaled together. Adding data without vocabulary capacity, or vocabulary without data diversity, both produce diminishing returns. The optimal tokenizer requires both sufficient vocabulary slots and sufficiently diverse training data to fill them.

7. HEAD-TO-HEAD: 64K v3 vs TURKISH & ENGLISH TOKENIZERS

Comparison across tokenizers on 21 test sentences covering daily speech, formal language, agglutination, code, and six specialized domains. Turkish tokenizers (this work, Kumru, TabiBERT, Hamza) were evaluated on the same full-sentence set; GPT-4/GPT-4o use different tokenizers and are included for reference.

Test Sentence	64k v3	48k v3	Kumru	TabiBERT	Hamza	GPT-4o	GPT-4
Merhaba dünya, nasılsın?	6	6	7	7	14	9	11
Türkiye Cumhuriyeti'nin başkenti Ankara'dır.	9	9	8	8	21	12	17
Evlerdekilere söyleyin, yarın geliyoruz.	8	9	11	11	21	12	18
Çekoslovakyalılaştıramadıklarımızdan mısınız?	9	10	12	12	29	19	21
Görüşebileceğimizi umuyorum.	5	6	6	6	15	11	14
Dün akşam arkadaşlarımla buluştuk.	5	5	9	9	20	20	25
Edebiyatımızın en önemli eserlerinden...	15	16	16	16	42	27	40
Osmanlı İmparatorluğu'nun son...	12	12	11	11	47	28	43
Spinoza'nın töz ontolojisi...	17	17	16	16	33	37	53
def __init__(self, value):	8	8	11	11	9	8	8
for i in range(len(dataset)):	9	9	13	13	12	7	7
Makine öğrenmesi algoritmalarının...	10	10	11	11	36	20	33
Büyükşehir belediyesi toplu taşıma...	8	9	8	8	28	17	28
İstanbul'dan Ankara'ya tren...	11	11	11	11	19	14	16
2024 yılında Türkiye'nin nüfusu...	11	11	15	15	32	15	26
Sanığın mahkumiyet kararına... (legal)	7	7	11	11	26	17	24
Anayasa Mahkemesi başvuruyu... (legal)	6	7	10	10	23	16	20
Hastanın ameliyat sonrası... (medical)	7	7	7	7	30	15	26
Şirketin halka arz sürecinde... (finance)	11	11	11	11	36	20	30
Fotosentez sırasında... (science)	11	11	12	12	36	29	38
Cumhurbaşkanlığı Sözcüsü basın...	7	8	8	8	39	18	29
TOTAL (21 sentences)	192	199	224	224	568	371	527

Totals from benchmark_tokenizers.py on 21 full sentences. Hamza uses GPT-2 tokenizer (50,257 vocab); Kumru and TabiBERT use ~50K BPE.

Observation: Kumru and TabiBERT produce identical token counts on every sentence in this benchmark (same vocabulary size 50,176; same total 224). Exact agreement across all 21 sentences is uncommon for independently trained BPE tokenizers. The finding is reported here without further interpretation.

Extended benchmark: 104 sentences (21 core + 83 hard/edgy)

The same tokenizers were run on an extended set: the 21 core sentences above plus 83 “hard” sentences (long agglutination, legal/medical/financial phrasing, colloquial/slang, numbers and dates, code snippets, punctuation and abbreviations, loanwords, case/diacritic edge cases). All counts from benchmark_tokenizers.py.

Tokenizer	Total tokens (104 sent)	vs best
64k v3	1,041	baseline (best)
48k v3	1,073	+3.1%
32k v2	1,163	+11.7%
16k v1	1,359	+30.5%
Kumru	1,198	+15.1%
TabiBERT	1,198	+15.1%
Hamza	2,451	+135.4%

64K remains best on the extended set; Kumru and TabiBERT again match each other (1,198). The hard set includes e.g. Muvaffakiyetsizleştiricileştiriveremeyebileceklerimizdenmişsinizcesine, legal (HMK 353, tahkim), medical (pankreatikoduodenektomi, kardiyovasküler), financial (BIST 100, SPK), slang (N'olcak, bişey), code (return {'key': value}), and loanwords (Startup'lar, API endpoint'i).

192

THIS WORK (64K v3)

224

KUMRU / TABIBERT (50K)

568

HAMZA (GPT-2 TOKENIZER)

527

GPT-4 (100K)

Result: On the same 21-sentence set, the 64K tokenizer uses 192 tokens versus 224 for Kumru and TabiBERT (~14% fewer) and 568 for Hamza (~66% fewer). Hamza’s tokenizer has vocabulary size 50,257 (identical to GPT-2) and is not optimized for Turkish. The 64K tokenizer matches or outperforms GPT-4/GPT-4o on code and shows substantial gains on legal, medical, and news domains. Kumru and TabiBERT perform similarly to each other; both use ~50K BPE.

8. DOMAIN-SPECIFIC ANALYSIS

Domain-targeted training data produces measurable improvements in specialized vocabulary handling. Below are token-level comparisons for each new domain.

Legal Turkish

Tokenizer	Tokens	"Anayasa Mahkemesi başvuruyu oybirliğiyle reddetti."
64k v3	6	`Anayasa \| Mahkemesi \| başvuruyu \| oybirliğiyle \| reddetti \| .`
Kumru	10	`Anayasa \| Mahkemesi \| başvur \| uyu \| oy \| bir \| liğiyle \| reddet \| ti \| .`
TabiBERT	10	(same as Kumru)
Hamza	23	(GPT-2 tokenizer)
GPT-4	20	(fragmented into sub-word pieces)

başvuruyu (the application) and oybirliğiyle (unanimously) are each single tokens in 64K. Kumru and TabiBERT fragment the first into 2 pieces and the second into 3 pieces. reddetti (rejected) is also a single token — Kumru/TabiBERT need 2 (reddet | ti). Result: 6 vs 10 (Kumru/TabiBERT), 6 vs 23 (Hamza).

Medical Turkish

Tokenizer	Tokens	"Hastanın ameliyat sonrası komplikasyon riski değerlendirilmelidir."
64k v3	7	`Hastanın \| ameliyat \| sonrası \| komplikasyon \| riski \| değerlendirilmelidir \| .`
Kumru	7	`Hastanın \| ameliyat \| sonrası \| komplikasyon \| riski \| değerlendirilmelidir \| .`
TabiBERT	7	(same as Kumru)
Hamza	30	(GPT-2 tokenizer)
GPT-4	26	(fragmented into sub-word pieces)

Hastanın (of the patient) is a single token. değerlendirilmelidir (must be evaluated) — a 6-morpheme suffix chain — is also a single token. Kumru/TabiBERT tie at 7; Hamza 30; GPT-4 26.

Financial Turkish

Tokenizer	Tokens	"Şirketin halka arz sürecinde sermaye piyasası kurulu onayı gerekmektedir."
64k v3	11	`Şirket \| in \| halka \| arz \| sürecinde \| sermaye \| piyasası \| kurulu \| onayı \| gerekmektedir \| .`
Kumru	11	`Şirket \| in \| halka \| arz \| sürecinde \| sermaye \| piyasası \| kurulu \| onayı \| gerekmektedir \| .`
TabiBERT	11	(same as Kumru)
Hamza	36	(GPT-2 tokenizer)
GPT-4	30	(fragmented into sub-word pieces)

News/Journalism Turkish

Tokenizer	Tokens	"Cumhurbaşkanlığı Sözcüsü basın toplantısında açıklamalarda bulundu."
64k v3	7	`Cumhurbaşkanlığı \| Sözcüsü \| basın \| toplantısında \| açıklamalarda \| bulundu \| .`
Kumru	8	`Cumhurbaşkanlığı \| Sözc \| üsü \| basın \| toplantısında \| açıklamalarda \| bulundu \| .`
TabiBERT	8	(same as Kumru)
Hamza	39	(GPT-2 tokenizer)
GPT-4	29	(fragmented into sub-word pieces)

Cumhurbaşkanlığı (Presidency) and Sözcüsü (Spokesperson) are each single tokens in 64K. Kumru and TabiBERT split Sözcüsü into 2 pieces; Hamza 39 tokens. The extra vocabulary capacity allows 64K to capture these high-frequency institutional terms as atomic units.

9. MORPHOLOGICAL ANALYSIS

The tokenizer learned Turkish morphology from pure statistics — no linguistic rules were programmed. BPE naturally discovered morpheme-like boundaries through frequency analysis of 22GB of text.

Verb morphology (learned, not coded)

Word	Tokens	Morphological Interpretation
geliyorum	`gel \| iyorum`	stem + present continuous 1st person
geldim	`gel \| dim`	stem + past tense 1st person
gelecek	`gelecek`	single token (very common word)
gelmiş	`gelmiş`	single token (common evidential)
geliyoruz	`geliyoruz`	single token (common 1st person plural)

Noun case suffixes

Word	Tokens	Count
ev (house)	`ev`	1
evde (in the house)	`evde`	1
evden (from the house)	`evden`	1
eve (to the house)	`eve`	1
evin (of the house)	`evin`	1
evler (houses)	`evler`	1

Six different grammatical forms of "ev" — all encoded as single tokens.

Suffix chain handling

Word	Tokens	Count
değerlendirilmelidir	`değerlendirilmelidir`	1
larımızdan (from our ...s)	`larımızdan`	1
gidebilirsiniz (you can go)	`gidebilirsiniz`	1
oybirliğiyle (unanimously)	`oybirliğiyle`	1

10. DIACRITIC ROBUSTNESS

Turkish users sometimes type without diacritics (c instead of ç, s instead of ş, i instead of ı). The tokenizer handles both forms, but correct Turkish is significantly more token-efficient — by design.

Correct Turkish	Tokens	Without Diacritics	Tokens	Cost
şehir	1	sehir	3	+2
büyükşehir	2	buyuksehir	6	+4
Türkiye	1	Turkiye	2	+1
öğrenci	1	ogrenci	3	+2
günaydın	2	gunaydin	3	+1

Design choice: Diacritic-free Turkish was deliberately excluded from the training corpus. This ensures that input without diacritics is still processable (nothing breaks — byte-level BPE can represent anything) but costs more tokens. The model trained on this tokenizer will accept sloppy input while always producing correct Turkish in its output — because correct Turkish is all it was trained on.

11. CONTEXT WINDOW: THE COMPOUNDING ADVANTAGE

Tokenizer efficiency is not a fixed saving — it is a multiplier on context length. The longer the context window, the more the advantage compounds. This has direct architectural implications for the target model.

Effective context capacity

At each context length, the 64K tokenizer holds significantly more Turkish text than competitors could fit in the same number of token slots:

Context Length	This Work (64K)	Kumru Equivalent	GPT-4 Equivalent	Extra Text Capacity
2,048 tokens	2,048	~2,387	~5,627	+339 vs Kumru
4,096 tokens	4,096	~4,773	~11,253	+677 vs Kumru
32,768 tokens	32,768	~38,187	~90,027	+5,419 vs Kumru
128,000 tokens	128,000	~149,333	~351,667	+21,333 vs Kumru

“Kumru Equivalent” = how many Kumru tokens would be needed to hold the same amount of Turkish text. Calculated from the efficiency gaps measured in Section 7.

Architectural implication: small model, massive context

The efficiency advantage fundamentally changes the optimal model architecture for Turkish. Two strategies were considered:

Strategy	Parameters	Context	Turkish Text Capacity	Trainability
Large model, short context	7B	4,096	~3–4 pages	Requires 40–80 GB VRAM
Small model, long context	1–2B	128K	~entire book	Trainable on consumer hardware

Strategic decision: A 1–2B parameter model with 128K context length is selected as the target architecture. With the 64K tokenizer, this configuration holds Turkish text equivalent to what a Kumru-tokenized model would need ~150K tokens for, or what GPT-4 would need ~303K tokens for. Most existing Turkish language models offer context lengths of 4K–8K tokens. A native-tokenized model with 128K context would be capable of processing entire legal case files, academic theses, or literary works in a single pass.

The embedding layer overhead of 64K vocabulary at 1B scale is approximately 3.3% of total parameters — a negligible cost for a permanent ~14% efficiency advantage over Kumru/TabiBERT on every token processed.

12. PRACTICAL IMPLICATIONS

~2.7×

FEWER TOKENS (vs GPT-4, 21 SENT.)

~14%

FEWER TOKENS vs KUMRU/TABIBERT

66%

FEWER TOKENS vs HAMZA (GPT-2)

DOMAINS COVERED

Turkish text processed by English-centric tokenizers incurs a roughly 2× token penalty across context length, speed, cost, and training efficiency. A native Turkish tokenizer eliminates this tax entirely.

The tokenizer covers 11 specialized domains (general, academic, legal, medical, financial, education, news, code, literary, reasoning, instructions), ensuring efficient tokenization regardless of subject matter.

13. PROJECT STATUS

Phase	Status	Key Result
Phase 1: Tokenizer	COMPLETE	64K vocab, ~14% fewer tokens than Kumru/TabiBERT, ~2.7× vs GPT-4, 11 domains
Phase 2: Architecture	NEXT	1–2B parameters, 128K context target
Phase 3: Pre-training	NEXT	Language learning from Turkish corpus
Phase 4: Fine-tuning	NEXT	Instruction following, chat capability

14. REPRODUCIBILITY

Code, data sources, and trained tokenizers are available.

Training script: train_tokenizer.py
Benchmark script: benchmark_tokenizers.py (104-sentence comparison: 21 core + 83 hard/edgy)
Training data: 22 GB across 27 files, 11 domains
Selected tokenizer: tokenizers/turkish_bpe_64k/tokenizer.json
Versions preserved: 16K, 32K, 48K, 64K × v1/v2/v3
Baselines: Kumru-2B (50,176), TabiBERT (50,176), Hamza (50,257, GPT-2 tokenizer), GPT-4 (cl100k_base), GPT-4o (o200k_base)

Conclusion. A purpose-built Turkish tokenizer is a prerequisite for efficient Turkish language modeling. The efficiency penalty imposed by English-centric tokenizers is measurable and eliminable. Three findings are emphasized: (1) the GPT-2 pre-tokenization regex breaks Turkish apostrophe-suffix patterns — an interaction not previously documented; (2) vocabulary saturation, not data redundancy, explains apparent diminishing returns at 48K — with implications for agglutinative languages; and (3) tokenizer efficiency compounds with context length, favoring smaller, long-context architectures. This report documents Phase 1 of an independent effort to build a Turkish language–native LLM from scratch.

A special note of gratitude is owed to Kumru AI: their Turkish LLM’s well-documented limitations in reasoning and Turkish morphology provided the initial motivation to build a proper Turkish language model from scratch. Hamza (emrecanacikgoz) and TabiBERT (boun-tabilab) tokenizers were also compared; see Section 7 and benchmark_tokenizers.py.

NATIVE TURKISH BPE TOKENIZER

TABLE OF CONTENTS

1. THE PROBLEM: THE LANGUAGE TAX

Same sentence, different cost

2. RELATED WORK

3. NOVEL FINDING: GPT-2 REGEX BREAKS TURKISH

How GPT-4 tokenizes Turkish apostrophe suffixes

The fix: cleaned Turkish regex

4. ARCHITECTURE DECISIONS

Special tokens

5. TRAINING CORPUS: 3 ITERATIONS

v1: Foundation (1.7 GB, 14 files)

v2: Quality Boost (10 GB, 16 files) — curated literary & academic data added

v3: Domain Coverage (22 GB, 27 files) — 7 new specialized domains

6. DATA & VOCABULARY SCALING EXPERIMENTS

Experiment A: Data scaling at 48K vocabulary

Experiment B: Vocabulary scaling — the breakthrough

7. HEAD-TO-HEAD: 64K v3 vs TURKISH & ENGLISH TOKENIZERS

Extended benchmark: 104 sentences (21 core + 83 hard/edgy)

8. DOMAIN-SPECIFIC ANALYSIS

Legal Turkish

Medical Turkish

Financial Turkish

News/Journalism Turkish

9. MORPHOLOGICAL ANALYSIS

Verb morphology (learned, not coded)

Noun case suffixes

Suffix chain handling

10. DIACRITIC ROBUSTNESS

11. CONTEXT WINDOW: THE COMPOUNDING ADVANTAGE

Effective context capacity

Architectural implication: small model, massive context

12. PRACTICAL IMPLICATIONS

13. PROJECT STATUS

14. REPRODUCIBILITY