Back to Research

BUILDING A TURKISH LLM FROM SCRATCH “?”

Project Context, Journey & What We Actually Discovered Along the Way

February 2026 • Independent Research • Living Document

5
PHASES PLANNED
1
PHASE COMPLETE
100M–4B
PARAMETER RANGE
128K
TARGET CONTEXT
64K
TOKENIZER VOCAB
What is this? This is not a paper. This is a living record of building a language-native Turkish LLM from the ground up — every decision, every mistake, every “aha” moment. It started as a tokenizer project and evolved into something deeper: a journey through the full stack of modern AI, from byte-pair encoding to Nietzsche. The goal is not obsessively creating the “best” LLM — it’s the deep understanding gained by building every layer from scratch. The path teaches more than the destination.

TABLE OF CONTENTS

1. Why This Exists 2. The North Star: Reasoning 3. The Journey So Far 4. Phase 1: The Tokenizer 5. The Door the Tokenizer Opened 6. How It Actually Works 7. What Architecture Taught Us 8. Design Philosophy: Less Is More 9. Important Decisions (Locked) 10. Data Strategy 11. What’s Next 12. Repo Snapshot

1. WHY THIS EXISTS

Every major LLM is built on an English-centric foundation. When Turkish text passes through GPT-4’s tokenizer, it costs roughly 2.7× more tokens than it should. Turkish’s agglutinative morphology — where meaning is packed into chains of suffixes — is alien to tokenizers trained on English.

Existing Turkish LLMs (Kumru, Hamza, LlamaTurk, TURNA, and work from Boğaziçi and ODTÜ) represent serious efforts with meaningful results. Some train custom tokenizers, some build from scratch, some extend multilingual bases. After examining them closely, the honest takeaway is: good work exists, but each makes different tradeoffs — and none of them gave us the full-stack understanding we were after. We wanted to build every layer ourselves, not because existing work is bad(except kumru, it is fundamentally broken), but because the process of building is where the learning happens.

Motivation: Not to compete with GPT-4 or Claude. But to understand — deeply, mechanically — how these systems work by building one. Every phase cracks your brain open a little more. The tokenizer phase alone taught more about information representation than any course could. Architecture taught what “reasoning” really means (and doesn’t). What comes next will teach even more.

2. THE NORTH STAR: REASONING

The primary goal is not knowledge coverage, not chat fluency, not benchmark scores. It is reasoning — extreme logic capabilities. The model must:

Even if the model doesn’t know many facts, it must reason correctly about whatever it does know. Facts can be retrieved; reasoning structure cannot.

Key distinction: “Learning reasoning” vs “Acting like reasoning.”

From prior fine-tuning experience: putting intentional mistake-then-correction patterns in SFT data made results always worse than the base model. The model doesn’t learn to “catch mistakes” — it learns to produce mistakes, because SFT teaches “output should look like this.”

Genuine reasoning comes from RL (RLVR) — reinforcement learning with verifiable rewards. The model generates its own answers, gets rewarded only for correct final answers, and discovers effective reasoning strategies through trial and error. SFT teaches format. RLVR teaches thinking. That’s the difference between acting and learning.

3. THE JOURNEY SO FAR

Roadmap

PHASE 1
Tokenizer
PHASE 3
Pretraining
PHASE 4
SFT
PHASE 5
RLVR
PhaseStatusWhat It TeachesScope
1. Tokenizer COMPLETE Information representation, morphology, data scaling 64K BPE, 22 GB corpus, 11 domains, 104-sentence benchmark
2. Architecture NEXT How computation becomes reasoning 100M–4B params, 128K context, decoder-only, reasoning-first
3. Pretraining (the actual “training”) PENDING What “knowledge” really means Next-token prediction on Turkish corpus (teacher forcing)
4. SFT (fine-tuning) PENDING Format, not reasoning Crystal-clear instruction data only. No mistakes.
5. RLVR (advanced training via rewards) PENDING What “correct” really means Math/code/logic problems with verifiable answers

4. PHASE 1: THE TOKENIZER — WHERE EVERYTHING BEGAN

What started as “just build a tokenizer” became a deep exploration of how language is represented as numbers, why English-centric design hurts every other language, and how data and vocabulary interact in surprising ways. (Full tokenizer report →)

~14%
FEWER TOKENS THAN KUMRU/TABIBERT
~2.7×
FEWER TOKENS THAN GPT-4
64K
VOCABULARY SIZE
22 GB
CORPUS (27 FILES, 11 DOMAINS)

Three discoveries that changed our understanding

Discovery 1: GPT-2 regex breaks Turkish. The pre-tokenization regex used by GPT-4, Llama 3, and Mistral contains English contraction patterns ('s|'t|'re|'d) that steal the first character of Turkish suffixes. Ankara'dır becomes ["Ankara", "'d", "ır"] instead of ["Ankara", "'", "dır"]. To our knowledge, this interaction was previously undocumented.
Discovery 2: Vocabulary saturation, not data saturation. When adding data from 10 GB to 22 GB at 48K vocabulary, improvement was only +0.9% (apparent diminishing returns). But training 64K on the same 22 GB yielded +10.1%. The 48K tokenizer had run out of merge slots — not data. Vocabulary and data must scale together.
Discovery 3: Pure statistics discover morphology. No linguistic rules were programmed. BPE naturally found morpheme-like boundaries from frequency patterns alone. Six grammatical forms of “ev” (house) — ev, evde, evden, eve, evin, evler — are all single tokens. değerlendirilmelidir (a 6-morpheme suffix chain meaning “must be evaluated”) is one token.

The tokenizer phase taught us: representation is everything. Before a model can reason about Turkish, it must be able to efficiently read and write it. A bad tokenizer is like trying to think through a straw — you can still get some signal through, but you’re wasting most of your capacity on the bottleneck.

Artifacts: EN reportTR reportbenchmark_tokenizers.py (104 sentences) • train_tokenizer.pytokenizers/turkish_bpe_64k/

5. THE DOOR THE TOKENIZER OPENED

The tokenizer phase gave us a working 64K Turkish BPE. But the deeper gift was something nobody expected: a complete shift in how we see AI, language, and the industry itself. This is the most important thing we’ve learned in this entire project so far.

What is a tokenizer, really?

Strip away the jargon. A tokenizer does one thing: it converts structured input into a sequence of numbers. We happened to use Turkish text as input. But nothing in the algorithm requires that input to be human language.

When we trained our 64K BPE, the algorithm didn’t “know” it was processing Turkish. It saw byte sequences, found frequently co-occurring patterns, and merged them into tokens. The output was a mapping: input patterns → integer IDs. That’s it. The algorithm doesn’t care whether those patterns are Turkish suffixes, musical notes, or chemical bonds.

Once you truly internalize this, a door opens that never closes.

The question that changed everything

What IS a “language”?

We kept saying “language model.” But what makes something a language? It’s any system with a vocabulary and grammar — a set of elements and rules for how they combine. Human language is one example. It is not the only one. It is not even the most important one for practical AI.

Music is a language. Notes are the vocabulary. Chord progressions, scales, rhythm patterns, key signatures — these are the grammar. A melody is a “sentence.” A symphony is a “document.” A “tokenizer” for music maps note events (pitch, duration, velocity, chord) to integer IDs. A transformer trained on those token sequences learns: after this chord progression, this resolution is likely. After this rhythmic pattern, this variation follows. The transformer doesn’t know it’s “making music.” It is predicting the next token — exactly as it does with Turkish words.

Proteins are a language. Amino acids are the vocabulary — just 20 base characters. Proteins are “sentences”: sequences that fold into 3D structures governed by physical rules. The “grammar” dictates which sequences form alpha helices, which form beta sheets, which combinations bind to specific receptors. A transformer trained on protein sequences learns this grammar — not because it understands biology, but because it finds statistical patterns in token sequences. This is literally how AlphaFold-class models work.

Chemical formulas are a language. SMILES notation encodes molecular structures as text strings. Atoms and bonds are the vocabulary. Valence rules, ring structures, functional groups — these are the grammar. A “tokenizer” maps chemical symbols to integers. The transformer learns: after this molecular fragment, this binding property is likely. Drug discovery models already work this way.

DNA is a language. Four nucleotides — A, T, C, G — that’s the entire vocabulary. Codon triplets encode amino acids. Regulatory regions control gene expression. Genomic models tokenize these sequences and learn to predict mutations, gene function, even disease risk. A vocabulary of 4, grammar encoded by billions of years of evolution.

A factory production line is a language. Material codes, machine settings, environmental conditions, test results — these form sequences with causal structure. The “vocabulary” might be 500–2000 tokens. The “grammar” is the physical causality: PVC_compound_A + temp_175 + speed_15 → tensile_PASS + shore_85. A 50-million-parameter model can learn to predict production outcomes before a single meter of cable is manufactured — saving material, energy, and time.

Critical clarification: these models do NOT “talk.” This is where most people get confused. A music model does not chat about music in English. Its input tokens are notes — literal pitch/duration/velocity values. Its output tokens are notes. It has never seen a single word of human language. A protein model does not describe proteins in sentences. Its tokens are amino acid codes — M E T H I O N I N E …. A factory model does not answer “what temperature should I use?” in Turkish. Its tokens are temp_175 speed_15 and it outputs tensile_PASS.

This is not fine-tuning a chatbot on domain text. That would still be an LLM talking about the domain in human language. This is a fundamentally different thing: the model’s entire vocabulary, grammar, and thought process exist within the domain notation itself. No human language involved. That is why they can be so small and so accurate.
The moment it clicked: We didn’t just build a Turkish tokenizer. We learned what tokenization fundamentally is. And once you see it, you can’t unsee it: every domain with sequential structure is a “language” waiting for its own tokenizer and its own model. Not an LLM. Not a chatbot. A purpose-built sequence predictor.

The five-step chain that opens every door

  1. A tokenizer is just: structured patterns → numbers
  2. A transformer is just: learn to predict the next number given previous numbers
  3. “Language model” is just what we call it when those numbers happen to represent words
  4. ANY sequential structured data can be tokenized
  5. Therefore: the transformer is a universal sequence learner, not a “language” model

Every domain is a language

Domain“Vocabulary”“Grammar”“Sentences”Model Size
Human language Words, subwords (64K BPE) Syntax, semantics, pragmatics Paragraphs, articles, books Billions (open-ended)
Music Notes, chords, rests, dynamics Harmony, rhythm, key, form Melodies, progressions, pieces Hundreds of millions
Proteins 20 amino acids Folding rules, binding affinities Protein chains Millions to low billions
Chemistry (SMILES) Atoms, bonds, ring markers Valence, stability, reactivity Molecular structures Hundreds of millions
Code Keywords, operators, identifiers Syntax rules, type systems Functions, programs Hundreds of millions–billions
DNA / Genomics 4 nucleotides (A, T, C, G) Codon rules, regulatory patterns Gene sequences Millions–hundreds of millions
Cable factory Material codes, machine settings Input → output causality Production runs 10–50M
Any factory / lab / clinic Domain-specific codes Domain-specific causal rules Process records 10–100M
“Talking” is the HARDEST application. Look at the table. Human language needs billions of parameters because it is ambiguous, open-ended, culturally dependent, and requires vast world knowledge. Every other domain is simpler: smaller vocabularies, clearer rules, measurable correctness. The industry obsesses over the hardest case and ignores the enormous value sitting in every structured dataset on Earth.

The cable factory — a concrete example

This isn’t hypothetical. Every cable factory generates data like this every day:

This model would be more accurate than GPT-4 for this specific task, orders of magnitude cheaper, runs on a laptop, keeps your proprietary data private, and was built using the exact same skills we’re learning by building a Turkish LLM: tokenizer design, architecture selection, training pipeline optimization.

What this means: the doors that opened

The moment we understood this, the project’s scope transformed from “build one Turkish LLM” to “learn to build any sequence model for any domain.” The possibilities:

The LLM is the hard path that teaches everything. We chose to build the hardest kind of sequence model — one that processes human language. Along the way, we learn tokenizer design, architecture choices, training dynamics, data strategy, evaluation methodology. Every single one of these skills transfers directly to building any domain-specific model. The Turkish LLM is not the destination. It’s the training ground. The real prize is the understanding, and that understanding has no ceiling.

The trap: building gods instead of tools

“Maybe the human mind seeks a god again — one who was killed by Nietzsche.”
— from our architecture discussion, on the industry’s obsession with building one omniscient AI

The AI industry is pouring billions into building an omniscient conversational entity — a digital god that answers everything through natural language. Every problem becomes “talk to the AI.”

But now we see clearly: most valuable real-world problems don’t need conversation. They need prediction, pattern recognition, optimization. The “talking” layer is expensive overhead when your actual need is “will this cable pass the tensile test?”

Building a full LLM when you need a domain predictor is like building a 747 when you need a bicycle. The bicycle is simpler, cheaper, and gets you where you’re going faster — if you’re going to the shop.

The orchestra vision

The future is not one massive model. It is orchestration: multiple small, specialized models working together, each optimal for its domain.

PLANNER
Routes requests
REASONER
Logic & decomposition
SPECIALIST
Domain knowledge
TOOLS
Calculator, code, search

We are building the reasoner. The specialists can be the factory models, the medical models, the financial models — each tiny, each accurate, each built with the same skills we are learning right now.

6. HOW IT ACTUALLY WORKS: STEP BY STEP, DOMAIN BY DOMAIN

Section 5 claimed every domain is a language and every sequence can be tokenized. That might still feel abstract. So let’s make it concrete. Below are mock walkthroughs showing exactly what happens inside the machine — from raw input to final output — for five different domains. The process is identical every time. Only the tokens change.

① Language model (Turkish LLM)

User asks: Ankara'nın nüfusu kaçtır?

Step 1 — Tokenize (text → numbers). The tokenizer looks up each piece in its 64K vocabulary:
“Ankara” → 3847  | “'nın” → 129  | “nüfusu” → 8412  | “kaçtır” → 5903  | “?” → 30
The model receives: [3847, 129, 8412, 5903, 30]. It has no idea these are Turkish words. It sees five integers.

Step 2 — Model processes (numbers → numbers). The transformer takes those 5 integers, converts each to a 2048-dimensional vector, passes them through 22 layers of attention and feed-forward networks. At the end, it outputs a probability distribution over all 64,000 tokens: “which token is most likely next?” It picks token 11297.

Step 3 — Detokenize (numbers → text). The tokenizer looks up 11297 in its vocabulary: 11297 → “Yaklaşık”. This is appended to the output.

Step 4 — Repeat. Now the model sees [3847, 129, 8412, 5903, 30, 11297] and predicts the next token. Then the next. Then the next. Token by token, the answer builds up:
11297 → “Yaklaşık” | 642 → “5” | 1830 → “milyon” | 7741 → “kişidir” | 4 → “.”

Final output: Yaklaşık 5 milyon kişidir.

Bonus — what if we ask: Ankara’nın başkenti nedir?
The tokenizer finds every word in its 64K vocabulary. The model processes the token sequence and generates an answer token by token: “Ankara bir başkenttir, bir ilin başkenti değildir.” It works. This model was built for Turkish text. Turkish words are its native tokens. Conversation is literally what it was trained to do.

② Music model

Context: A model trained on thousands of MIDI sequences. Vocabulary: ~2000 tokens (note pitches, durations, velocities, chords, rests).

Step 1 — Tokenize (notes → numbers). A chord progression is encoded:
“C_maj” → 42  | “quarter” → 7  | “G_maj” → 58  | “quarter” → 7  | “Am” → 51  | “quarter” → 7  | “F_maj” → 47  | “quarter” → 7
Model receives: [42, 7, 58, 7, 51, 7, 47, 7]. No words. No language. Just integers representing a I–V–vi–IV progression.

Step 2 — Model processes. Transformer predicts: after this progression, token 42 is most likely next.

Step 3 — Detokenize (numbers → notes). 42 → “C_maj”. The progression resolves back to the tonic.

Step 4 — Repeat. Next token: 12 → “half” (half note duration). Then: 71 → “E4” (melody note). Token by token, a melody is composed.

No words were involved at any step. The model “speaks” music. Its vocabulary is notes. Its output is a playable MIDI sequence.

Bonus — what if we ask: Ankara’nın başkenti nedir?
Step 1 crashes immediately. The tokenizer tries to look up “Ankara” in its vocabulary. Its vocabulary contains C_maj, quarter, E4, rest — notes, durations, chords. No Turkish words. No words of any language. “Ankara” does not exist. “Başkent” does not exist. “Nedir” does not exist. The input cannot even be converted to numbers. There is nothing to feed the model. It is like trying to insert a Turkish sentence into a piano roll. Not a wrong answer — no answer is possible. The model has never seen a word. It does not know what a word is. It does not know what a question is. It does not know what “conversation” means.

③ Protein model

Context: A model trained on millions of known protein sequences. Vocabulary: 25 tokens (20 amino acids + start/end/padding/unknown/mask).

Step 1 — Tokenize (amino acids → numbers). A protein fragment:
“M” → 1  | “A” → 5  | “L” → 10  | “W” → 17  | “K” → 9  | “L” → 10  | “P” → 12
Model receives: [1, 5, 10, 17, 9, 10, 12]. No English. No Turkish. Just amino acid IDs.

Step 2 — Model processes. Given this sequence, the transformer predicts the next amino acid. It outputs a distribution over 25 tokens. Highest probability: token 4.

Step 3 — Detokenize (numbers → amino acids). 4 → “V” (Valine). The protein chain grows.

Step 4 — Repeat. The model continues until it predicts the “END” token. The output is a complete protein sequence that can be analyzed for folding, binding, or function.

Vocabulary: 25 tokens. No human language. Just biochemistry as a sequence.

Bonus — what if we ask: Ankara’nın başkenti nedir?
Step 1 crashes. The tokenizer’s entire vocabulary is: M, A, L, W, K, P, V, G, I, F, Y, C, H, R, N, D, E, Q, S, T, START, END, PAD, UNK, MASK. Twenty-five tokens. All amino acids. “Ankara”? The tokenizer might match individual letters — A, n, k, a, r, a — but “n” is not an amino acid. “k” is not an amino acid. Most characters map to UNK (unknown). The model receives a string of unknowns and random amino acid matches: [UNK, 5, UNK, UNK, UNK, 5, UNK, UNK, UNK...]. If forced to run, it will output a random protein fragment — not an answer, not a sentence, just meaningless amino acid noise. It has no concept of language, questions, or communication.

④ Cable factory model

Context: A model trained on 50,000 production records. Vocabulary: ~800 tokens (material codes, machine settings, test results).

Step 1 — Tokenize (production data → numbers). An engineer enters a new production setup:
“[MATERIAL]” → 1  | “PVC_A7” → 34  | “[TEMP]” → 2  | “175” → 412  | “[SPEED]” → 3  | “15” → 287  | “[PRESSURE]” → 4  | “8” → 193  | “[PREDICT]” → 5
Model receives: [1, 34, 2, 412, 3, 287, 4, 193, 5]. Not a sentence. A structured production specification.

Step 2 — Model processes. Transformer outputs token 601.

Step 3 — Detokenize (numbers → results). 601 → “tensile_PASS”.

Step 4 — Repeat. Next tokens: 622 → “elongation_420” | 709 → “flame_V0” | 685 → “shore_85”.

Final output: tensile_PASS elongation_420 flame_V0 shore_85
The engineer now knows — before manufacturing — that this setup will pass all tests. 800 tokens. 10M parameters. Runs on a laptop. No human language at any step.

Bonus — what if we ask: Ankara’nın başkenti nedir?
Step 1 crashes. The tokenizer knows: [MATERIAL], PVC_A7, [TEMP], 175, [SPEED], tensile_PASS — 800 tokens, all production codes and test results. Not a single human word. “Ankara” is not a material. “Başkent” is not a machine setting. “Nedir” is not a test result. The input cannot be tokenized. Even if you forced random token mappings, the model would output something like shore_72 elongation_310 flame_V1 — a meaningless production prediction. It has never encountered a human sentence in its entire existence. It doesn’t know humans exist. It knows cables.

⑤ DNA / Genomics model

Context: A model trained on genome sequences. Vocabulary: 7 tokens (A, T, C, G + start/end/unknown).

Step 1 — Tokenize (nucleotides → numbers). A gene fragment:
“A” → 1  | “T” → 2  | “G” → 3  | “C” → 4  | “G” → 3  | “A” → 1  | “T” → 2
Model receives: [1, 2, 3, 4, 3, 1, 2]. Seven numbers. The model doesn’t know what DNA is.

Step 2 — Model processes. Given this context, the transformer predicts: token 4 (C) is most likely next.

Step 3 — Detokenize. 4 → “C”.

Step 4 — Repeat. The model generates the rest of the sequence, which can then be analyzed for gene function, mutation risk, or regulatory patterns.

Vocabulary: 7 tokens. The smallest possible “language.” Same transformer. Same process.

Bonus — what if we ask: Ankara’nın başkenti nedir?
Step 1 crashes. The vocabulary is: A, T, C, G, START, END, UNK. Seven tokens. “Ankara” becomes [A, UNK, UNK, A, UNK, A] — it can only see the letter A because Adenine happens to share that symbol. The rest is unknown. The model would output something like T G C A A T G C — a DNA sequence fragment. Not a word. Not a sentence. A string of nucleotides. It has never seen a human language. It has seven tokens. It cannot even represent the alphabet, let alone form a thought.
See the pattern? Every single example above follows the exact same four steps:

1. Domain input → tokenizer → integer sequence
2. Integer sequence → transformer → predicted next integer
3. Predicted integer → tokenizer (reverse) → domain output
4. Repeat until done

And the Bonus examples reveal something even more important:

A domain-specific model does not “speak.” It does not know what human language is. It does not know what a question is. It does not know what a conversation is. It has never seen a word. When you type Ankara’nın başkenti nedir? into a music model, the input cannot even enter the machine — the tokenizer has no mapping for human words. When you force it into a protein model, you get random amino acids back. When you force it into a factory model, you get cable test results. When you force it into a DNA model, you get nucleotides.

This is the critical distinction: an LLM is just one type of transformer model — one where the tokenizer happens to map human words to numbers, and the training data happens to be human conversations and text. That’s what gives it the ability to “talk.” Remove the word-based tokenizer, train on MIDI files instead of Wikipedia, and you get a model that composes music but couldn’t say “hello” if its life depended on it. The transformer engine is identical. The tokenizer decides what world the model lives in.

People know that LLMs convert words to numbers internally. What they often miss is that domain-specific models don’t convert words to numbers — they were never designed to receive words at all. Their tokenizer speaks a completely different language: notes, amino acids, machine codes, nucleotides. They don’t “know about” their domain through language — they think in their domain’s native tokens, the way an LLM thinks in words.

Now picture the damage the LLM hype causes in practice. A cable factory needs to predict test results for a new material-and-machine configuration. The “AI = LLM” mindset says: build (or buy) a language model. So they start. Phase 1: train a tokenizer on text — weeks. Train the base model on billions of words so it learns to talk — months, hundreds of thousands of dollars in compute. Phase 2: fine-tune it on domain documents — more weeks, more failed runs, more cost. Phase 3: reinforcement learning to improve accuracy — more days, more weeks. And after all of that, what is the actual input to this colossal system? A chat message:
“Hello, the materials are XLPE, CAT113, RAL9100 dye. Machine settings: extruder speed 12, temperature 185, pressure 8. What will the test results be?”

Read that input again. Really read it. You spent months teaching a machine to understand human language, just to type a sentence that is already structured data pretending to be a conversation. The model now has to parse your natural language back into the structured values you already had, hope it doesn’t hallucinate, and produce a natural-language answer that you then have to parse again to extract the actual numbers. You added an entire human-language layer — costing months and fortunes — as a detour around the direct path.

The direct path? A domain tokenizer with 800 tokens. Input: [1, 34, 2, 412, 3, 287, 4, 193, 5]. Output: tensile_PASS elongation_420 flame_V0 shore_85. No conversation. No parsing. No hallucination. 10M parameters. Trained in hours on actual production records. Runs on a laptop. The entire LLM pipeline — months of pretraining, fine-tuning, reinforcement learning, prompt engineering — existed only to add a chat interface on top of what should have been a direct sequence-to-sequence prediction. That is the cost of not understanding tokenization.

This is why understanding tokenizers was the most important first step of our journey. It wasn’t just about Turkish morphology. It was about understanding that the tokenizer is the entire interface between any domain and the machine that learns from it. Change the tokenizer, change the world the model inhabits. The engine stays the same.
A reminder worth repeating.

An LLM is a human-language-domain-specific transformer. Nothing more, nothing less. It is not “artificial intelligence.” It is one application of a sequence-learning architecture to one particular domain: human text. AI is not equal to LLM.

Once tokenization is truly understood, this stops being a semantic argument and becomes an engineering revelation. It is not about “talking Turkish to an English protein model.” A protein model does not talk at all — not in Turkish, not in English, not in any human language. It communicates in amino acid sequences. A factory model communicates in production codes. A music model communicates in notes. These are entirely different modes of communication, as alien to human language as sonar is to speech.

And this is exactly why the current industry obsession with ever-larger LLMs is a dead end for real-world problems. A 500-billion-parameter model that “talks” impressively is spectacular as a demo. But ask it to predict whether a cable will pass a tensile test given specific extrusion parameters, and it will hallucinate a plausible-sounding paragraph that is entirely wrong — because it has never seen a production record. It learned language patterns, not physics. Studies consistently show that roughly 95% of enterprise LLM implementations fail to deliver real value. The reason is not that the technology is bad. The reason is that the tool is wrong for the job. Companies are trying to solve domain-specific sequence problems with a human-conversation machine — and then wondering why it doesn’t work.

The tragedy is that this failure is often blamed on “AI not being ready,” when in fact AI is ready — just not in the form most people have been sold. A 10-million-parameter domain model with 800 tokens, trained on actual production data, will outperform a trillion-parameter LLM on that domain every single time — at a fraction of the cost, running on a laptop, with no hallucinations, because every token in its vocabulary maps to something real.

The hype conflated “AI” with “chatbot,” and that conflation costs industries billions. Understanding tokenization is the way out. Once you see that the transformer is a universal engine and the tokenizer is a swappable lens, the entire landscape changes. The question is no longer “how do I make the LLM understand my factory?” The question becomes: “what tokenizer does my factory need?”

7. WHAT THE ARCHITECTURE TAUGHT US ABOUT REASONING

If Sections 5 and 6 showed us that the transformer is a universal sequence learner — same four steps, any domain — this section asks: how does a sequence learner develop something that looks like reasoning? Understanding architecture required understanding what “reasoning” actually means inside a neural network — and what it doesn’t. Remember: everything below applies not just to LLMs, but to any sequence model — the same mechanisms that let a language model “reason” about Turkish let a protein model “reason” about folding. (A detailed architecture research page will follow, like the tokenizer report.)

The training pipeline (sequential, not a choice)

PRETRAINING
“Training” — learn language & patterns
SFT
“Fine-tuning” — learn format
RLVR
Reinforcement learning — learn reasoning

These are not alternatives. They’re sequential phases, each teaching fundamentally different things:

PhaseInputAlgorithmWhat It Teaches
Pretraining (what people usually call “training”) Raw text (no QA pairs) Predict next token at every position Language, facts, reasoning patterns
SFT (what people usually call “fine-tuning”) Clean instruction-response pairs Same (next-token prediction) How to follow instructions. NOT reasoning.
RLVR (reinforcement learning with verifiable rewards) Problems with verifiable answers Generate → verify → reward/penalize Self-correction, decomposition, genuine reasoning

What generalizes vs what doesn’t

CapabilityHow LearnedGeneralizes?
Facts (“Ankara is capital”)Memorized from dataNo — only knows what it saw
Small arithmetic (2+3=5)Pattern memorizationPartially (up to ~4–5 digits)
Large arithmetic (234871...+12309...)Would need precise computationNo — LLMs fail reliably
Logical structure (A→B, B→C ⇒ A→C)Learns abstract transformation in vector spaceYes — transfers to new content
Problem decompositionLearns structural patternYes — transfers across domains
Tool use (“this needs a calculator”)Learns WHEN to delegateYes — genuine generalization
Key insight: generalization = learning STRUCTURE, not answers.

The model doesn’t memorize “2+3=5.” It learns the structure of addition from thousands of examples. For small numbers, this works. For large numbers, it fails — because precise multi-digit carry operations exceed what next-token prediction can reliably do. The real generalization is knowing WHAT to do (“this needs a calculator”), not doing the computation itself.

How self-correction works (mechanistically)

An LLM doesn’t “realize” errors the way we do. At each token position, the attention mechanism can attend to all previous tokens. As more context accumulates, inconsistencies become statistically detectable — the probability distribution shifts toward correction tokens. “Backtracking” is not real backtracking: the model generates new tokens that redirect (“wait, that’s wrong…”). The wrong tokens remain in context.

This self-correction ability comes from RL training, not from seeing error-correction patterns in data. RL rewards reasoning chains that self-correct AND reach correct answers. The model discovers that “check your work” is a rewarding strategy.

Q: Is it reasoning, or is it mimicry?

The honest answer: we don’t know. The model learns reasoning patterns from data. When it encounters a new problem, it applies those patterns. Is that “real reasoning” or “sophisticated pattern matching”? The debate is unresolved. Evidence is mixed: models solve novel problems (suggesting generalization beyond mimicry) but also fail on trivially modified versions of problems they ace (suggesting pattern matching).

Our practical answer: the distinction may not matter. What matters is: does the model arrive at correct answers on novel problems? That’s measurable. RLVR pushes the model from shallow mimicry toward robust application by rewarding correctness, not looking correct.

Q: What is “correct” for a language model?

DomainWhat “Correct” MeansVerifiable?
MathThe answer is right (2+2=4)Yes
CodeIt compiles and passes testsYes
LogicConclusion follows from premisesMostly yes
General languageCoherent, relevant, preferred by humansNo — subjective
The deeper insight: For general “talking,” there is no absolute correct. But the process of reasoning can be correct even when the answer is subjective. “What do you think about X?” has no right answer — but decomposing the question, considering multiple perspectives, identifying tradeoffs, and reaching a coherent position: that process can be done well or poorly. Logical validity is universal. It works whether you’re doing math, philosophy, law, or cooking. The form of reasoning transfers.

Q: LLM reasoning = search algorithms?

An insight from the architecture discussion: LLM self-correction resembles tree search (explore paths, evaluate, redirect). But with critical differences:

  1. The tree doesn’t exist beforehand — it’s generated token by token
  2. No real backtracking — only forward corrections (“wait, that’s wrong…”)
  3. For general language, there is no “correct” node — only for verifiable domains (math, code, logic)

Research formalizes this as Tree of Thoughts, Process Reward Models, and MCTS for LLMs. The analogy holds structurally but breaks mechanistically. Still, it implies: small models can “search” well if given sufficient thinking budget (extended thinking = bigger search budget).

8. DESIGN PHILOSOPHY: LESS IS MORE

100M–4B
PARAMETER RANGE (FLEXIBLE)
SCALE IS A DESIGN CHOICE
Quality
OVER QUANTITY — ALWAYS

We are not committed to a single scale. We are open to 100M, 360M, 1B, 2B, 3B, 4B — and “less is not limited.” The belief: with extremely optimal architecture and pretraining, smaller models can match or approach bigger ones.

As we discovered in Section 5, this extends far beyond LLMs. Every domain with sequential structure can have its own tiny, accurate model. The world is moving toward specialized models combined into orchestras — and we are positioned to build them.

9. IMPORTANT DECISIONS (LOCKED)

DecisionChoiceRationale
Tokenizer64K BPE v3 (our own)~14% better than Kumru/TabiBERT, ~2.7× better than GPT-4
ArchitectureDecoder-onlyStandard for generative reasoning LLMs; encoder can be separate component
Parameter range100M–4B“Less is more” — optimal architecture can punch above its weight
Context length128K tokensProcess entire legal cases, theses, books in one pass
Position encodingRoPE under questionPrior fine-tuning showed terrible long-context results with RoPE. Prefer ALiBi/learned or validated fix.
Training pipelinePretraining (training) → SFT (fine-tuning) → RLVR (reinforcement learning)Sequential phases, each teaches different things. Not a choice.
SFT data qualityCrystal-clear onlyConfirmed: mistakes in SFT data = model learns to produce mistakes
Literature searchRequired before deep decisionsUse arXiv, HF, ACL, not just Google. Avoid overconfident outdated advice.

10. DATA STRATEGY

This section focuses on data for our Turkish LLM. But through the lens of Section 5: everything described below is a template. Swap “Turkish text” for “protein sequences” or “production logs,” and the same pipeline structure applies — just with a different tokenizer and different domain data.

Pretraining data — “training” (quantity, diverse)

Raw Turkish text — no QA pairs, no formatting. The model reads continuous text and predicts the next token at every position.

SourcePurposeSplit
Turkish Wikipedia, news, books, forumsLanguage structure, grammar, fluency80–90% Turkish
10–20% English
Legal, medical, scientific, financial textDomain vocabulary, formal reasoning
Code (Python, etc.)Logical structure, precise reasoningEnglish helps with
cross-lingual transfer
Math texts, scientific papersReasoning patterns, formal arguments

SFT data — “fine-tuning” (quality, clean)

Turkish instruction-response pairs. Clean, no mistakes. Teaches format, not reasoning.

RLVR data — reinforcement learning (verifiable problems)

Math (GSM8K-style, competition math), code problems, logic puzzles. Can be translated to Turkish. Math and logic are language-light — 17 × 23 = ? works in any language. This is where reasoning is actually trained.

Why verbal examples are still needed: Pure abstract logic (A→B, B→C ⇒ A→C) is not enough on its own. The model operates on tokens (words/subwords). It needs real-world sentences to learn (1) how to recognize reasoning situations in natural language, (2) how to parse language into components it can reason about, and (3) how to express reasoning in natural language. The abstract logic is a small set of patterns; the verbal data teaches the model to connect those patterns to the real world.

11. WHAT’S NEXT

DONE Phase 1: Tokenizer
64K BPE, 22 GB corpus, 11 domains, GPT-2 regex bug discovered, vocabulary saturation phenomenon documented.
NEXT Phase 2: Architecture
Select base architecture (100M–4B). Resolve position encoding (RoPE vs ALiBi vs learned). Reasoning-first design. Literature search on 2025–2026 SOTA small models.
Phase 3: Pretraining (the actual “training”)
Build Turkish corpus pipeline. Next-token prediction on billions of tokens. Learn language, world knowledge, reasoning patterns. This is what creates the base model from scratch.
Phase 4: SFT — Supervised Fine-Tuning (what people call “fine-tuning”)
Crystal-clear instruction data. Teach the model to follow instructions and converse. Format only, not reasoning. This is the step that turns a raw base model into a chatbot.
Phase 5: RLVR — Reinforcement Learning with Verifiable Rewards
Reward the model for correct answers on math/code/logic. The model discovers genuine reasoning strategies through trial and error. This is where the north star is reached.
LATER Orchestra & Domain Models
Multiple small specialized models working together. Domain-specific models (factory, materials, etc.) using the same skills learned here. The practical payoff.
What we know so far: The tokenizer phase taught about information representation — and then blew the entire project open by revealing that every sequential domain is a language (Section 5) — and Section 6 proved it with concrete step-by-step walkthroughs, domain by domain. The architecture discussion taught what reasoning really is (and isn’t) inside a neural network. Pretraining will teach what “knowledge” means. SFT will teach what “format” means. RLVR will teach what “correct” means. Each phase opens the mind a little more. And every lesson applies not just to our Turkish LLM, but to any sequence model we might build for any domain.

Even if the final model isn’t the best in the world, the person who deeply understands every layer of the stack is more dangerous than the person who trains the biggest model. The biggest model is just money. Understanding is leverage.

12. REPO SNAPSHOT

PathWhat It Contains
tokenizers/turkish_bpe_64k/Selected tokenizer (64K BPE v3)
tokenizers/turkish_bpe_{16k,32k,48k}_*/All experimental versions preserved
tokenizers/kumru_2b_reference/Kumru baseline for comparison
data/processed/22 GB training corpus (27 files, 11 domains)
train_tokenizer.pyTokenizer training script
benchmark_tokenizers.py104-sentence benchmark (21 core + 83 hard/edgy)
docs/tokenizer-research.htmlFull tokenizer research report (EN)
docs/tokenizer-research_tr.htmlFull tokenizer research report (TR)
docs/project-context.htmlThis file — the journey document
reference_architecture/Config examples, literature review, README
PROJECT_CONTEXT.mdMachine-readable project context (for AI sessions)
Final thought. This project started with a simple question: “Can we build a better Turkish tokenizer?” That question led somewhere nobody expected. We learned how language becomes numbers — and then realized everything becomes numbers the same way. Music, proteins, factory data, DNA. The tokenizer wasn’t just a Turkish text tool. It was the universal interface between any domain and a learning machine. That single realization broke the project wide open.

It also broke the biggest illusion in the industry: that AI equals LLM. It doesn’t. An LLM is a human-language-domain-specific transformer — one application of a universal engine to one particular domain. Once you see that, you see why trillion-parameter chatbots fail at factory floors, why 95% of enterprise LLM projects collapse, and why the answer was never “make the LLM bigger.” The answer is: build the right tokenizer for the right domain, and let a tiny model do what a giant one never could. We went from byte-pair encoding to Nietzsche to industrial economics in a single conversation.

The tokenizer opened the first door. Architecture opened the second. There are more doors ahead — pretraining, SFT, RLVR, orchestration, domain models. Each one will teach something that no paper or course can: the understanding that comes from building it yourself, hitting walls, and figuring out why. And every lesson will reinforce the same truth: the transformer is the engine, the tokenizer is the lens, and the world is full of domains waiting for their own small, precise, purpose-built models.

This is a living document. It will grow with every phase completed, every decision made, every insight earned.

© 2026 • Independent Research • Tokenizer ReportTokenizer Raporu (TR)