BUILDING A TURKISH LLM FROM SCRATCH “?”
Project Context, Journey & What We Actually Discovered Along the Way
TABLE OF CONTENTS
1. WHY THIS EXISTS
Every major LLM is built on an English-centric foundation. When Turkish text passes through GPT-4’s tokenizer, it costs roughly 2.7× more tokens than it should. Turkish’s agglutinative morphology — where meaning is packed into chains of suffixes — is alien to tokenizers trained on English.
Existing Turkish LLMs (Kumru, Hamza, LlamaTurk, TURNA, and work from Boğaziçi and ODTÜ) represent serious efforts with meaningful results. Some train custom tokenizers, some build from scratch, some extend multilingual bases. After examining them closely, the honest takeaway is: good work exists, but each makes different tradeoffs — and none of them gave us the full-stack understanding we were after. We wanted to build every layer ourselves, not because existing work is bad(except kumru, it is fundamentally broken), but because the process of building is where the learning happens.
2. THE NORTH STAR: REASONING
The primary goal is not knowledge coverage, not chat fluency, not benchmark scores. It is reasoning — extreme logic capabilities. The model must:
- Understand — parse the input, identify what’s being asked
- Decompose — break the problem into pieces
- Reason step by step — apply logical structure (if A→B and B→C, then A→C)
- Self-check — detect inconsistencies and correct course
- Act like a scientist — not mimic one, but reason like one from the inside
Even if the model doesn’t know many facts, it must reason correctly about whatever it does know. Facts can be retrieved; reasoning structure cannot.
From prior fine-tuning experience: putting intentional mistake-then-correction patterns in SFT data made results always worse than the base model. The model doesn’t learn to “catch mistakes” — it learns to produce mistakes, because SFT teaches “output should look like this.”
Genuine reasoning comes from RL (RLVR) — reinforcement learning with verifiable rewards. The model generates its own answers, gets rewarded only for correct final answers, and discovers effective reasoning strategies through trial and error. SFT teaches format. RLVR teaches thinking. That’s the difference between acting and learning.
3. THE JOURNEY SO FAR
Roadmap
Tokenizer
Architecture
Pretraining
SFT
RLVR
| Phase | Status | What It Teaches | Scope |
|---|---|---|---|
| 1. Tokenizer | COMPLETE | Information representation, morphology, data scaling | 64K BPE, 22 GB corpus, 11 domains, 104-sentence benchmark |
| 2. Architecture | NEXT | How computation becomes reasoning | 100M–4B params, 128K context, decoder-only, reasoning-first |
| 3. Pretraining (the actual “training”) | PENDING | What “knowledge” really means | Next-token prediction on Turkish corpus (teacher forcing) |
| 4. SFT (fine-tuning) | PENDING | Format, not reasoning | Crystal-clear instruction data only. No mistakes. |
| 5. RLVR (advanced training via rewards) | PENDING | What “correct” really means | Math/code/logic problems with verifiable answers |
4. PHASE 1: THE TOKENIZER — WHERE EVERYTHING BEGAN
What started as “just build a tokenizer” became a deep exploration of how language is represented as numbers, why English-centric design hurts every other language, and how data and vocabulary interact in surprising ways. (Full tokenizer report →)
Three discoveries that changed our understanding
's|'t|'re|'d) that steal the first character of
Turkish suffixes. Ankara'dır becomes ["Ankara", "'d", "ır"] instead of
["Ankara", "'", "dır"]. To our knowledge, this interaction was previously undocumented.
ev, evde, evden, eve, evin, evler — are all single tokens.
değerlendirilmelidir (a 6-morpheme suffix chain meaning “must be evaluated”) is one token.
The tokenizer phase taught us: representation is everything. Before a model can reason about Turkish, it must be able to efficiently read and write it. A bad tokenizer is like trying to think through a straw — you can still get some signal through, but you’re wasting most of your capacity on the bottleneck.
Artifacts:
EN report •
TR report •
benchmark_tokenizers.py (104 sentences) •
train_tokenizer.py •
tokenizers/turkish_bpe_64k/
5. THE DOOR THE TOKENIZER OPENED
The tokenizer phase gave us a working 64K Turkish BPE. But the deeper gift was something nobody expected: a complete shift in how we see AI, language, and the industry itself. This is the most important thing we’ve learned in this entire project so far.
What is a tokenizer, really?
Strip away the jargon. A tokenizer does one thing: it converts structured input into a sequence of numbers. We happened to use Turkish text as input. But nothing in the algorithm requires that input to be human language.
When we trained our 64K BPE, the algorithm didn’t “know” it was processing Turkish. It saw byte sequences, found frequently co-occurring patterns, and merged them into tokens. The output was a mapping: input patterns → integer IDs. That’s it. The algorithm doesn’t care whether those patterns are Turkish suffixes, musical notes, or chemical bonds.
Once you truly internalize this, a door opens that never closes.
The question that changed everything
We kept saying “language model.” But what makes something a language? It’s any system with a vocabulary and grammar — a set of elements and rules for how they combine. Human language is one example. It is not the only one. It is not even the most important one for practical AI.
Music is a language. Notes are the vocabulary. Chord progressions, scales, rhythm patterns, key signatures — these are the grammar. A melody is a “sentence.” A symphony is a “document.” A “tokenizer” for music maps note events (pitch, duration, velocity, chord) to integer IDs. A transformer trained on those token sequences learns: after this chord progression, this resolution is likely. After this rhythmic pattern, this variation follows. The transformer doesn’t know it’s “making music.” It is predicting the next token — exactly as it does with Turkish words.
Proteins are a language. Amino acids are the vocabulary — just 20 base characters. Proteins are “sentences”: sequences that fold into 3D structures governed by physical rules. The “grammar” dictates which sequences form alpha helices, which form beta sheets, which combinations bind to specific receptors. A transformer trained on protein sequences learns this grammar — not because it understands biology, but because it finds statistical patterns in token sequences. This is literally how AlphaFold-class models work.
Chemical formulas are a language. SMILES notation encodes molecular structures as text strings. Atoms and bonds are the vocabulary. Valence rules, ring structures, functional groups — these are the grammar. A “tokenizer” maps chemical symbols to integers. The transformer learns: after this molecular fragment, this binding property is likely. Drug discovery models already work this way.
DNA is a language. Four nucleotides — A, T, C, G — that’s the entire vocabulary. Codon triplets encode amino acids. Regulatory regions control gene expression. Genomic models tokenize these sequences and learn to predict mutations, gene function, even disease risk. A vocabulary of 4, grammar encoded by billions of years of evolution.
A factory production line is a language. Material codes, machine settings, environmental
conditions, test results — these form sequences with causal structure. The “vocabulary” might be
500–2000 tokens. The “grammar” is the physical causality: PVC_compound_A + temp_175 +
speed_15 → tensile_PASS + shore_85. A 50-million-parameter model can learn to predict production
outcomes before a single meter of cable is manufactured — saving material, energy, and time.
M E T H I O N I N E …. A factory model does not answer
“what temperature should I use?” in Turkish. Its tokens are temp_175 speed_15 and
it outputs tensile_PASS.This is not fine-tuning a chatbot on domain text. That would still be an LLM talking about the domain in human language. This is a fundamentally different thing: the model’s entire vocabulary, grammar, and thought process exist within the domain notation itself. No human language involved. That is why they can be so small and so accurate.
The five-step chain that opens every door
- A tokenizer is just: structured patterns → numbers
- A transformer is just: learn to predict the next number given previous numbers
- “Language model” is just what we call it when those numbers happen to represent words
- ANY sequential structured data can be tokenized
- Therefore: the transformer is a universal sequence learner, not a “language” model
Every domain is a language
| Domain | “Vocabulary” | “Grammar” | “Sentences” | Model Size |
|---|---|---|---|---|
| Human language | Words, subwords (64K BPE) | Syntax, semantics, pragmatics | Paragraphs, articles, books | Billions (open-ended) |
| Music | Notes, chords, rests, dynamics | Harmony, rhythm, key, form | Melodies, progressions, pieces | Hundreds of millions |
| Proteins | 20 amino acids | Folding rules, binding affinities | Protein chains | Millions to low billions |
| Chemistry (SMILES) | Atoms, bonds, ring markers | Valence, stability, reactivity | Molecular structures | Hundreds of millions |
| Code | Keywords, operators, identifiers | Syntax rules, type systems | Functions, programs | Hundreds of millions–billions |
| DNA / Genomics | 4 nucleotides (A, T, C, G) | Codon rules, regulatory patterns | Gene sequences | Millions–hundreds of millions |
| Cable factory | Material codes, machine settings | Input → output causality | Production runs | 10–50M |
| Any factory / lab / clinic | Domain-specific codes | Domain-specific causal rules | Process records | 10–100M |
The cable factory — a concrete example
This isn’t hypothetical. Every cable factory generates data like this every day:
- Tokenizer vocabulary: ~500–2000 tokens (material codes, machine settings, test result codes)
- Input:
[MATERIAL] PVC_compound_A [SETTINGS] temp_175 speed_15 pressure_8 - Output:
[RESULTS] tensile_pass elongation_420 flame_V0 shore_hardness_85 - Model size: 10–50M parameters. Trains in hours on a single GPU.
- Value: Predict test results before wasting material on a production run
This model would be more accurate than GPT-4 for this specific task, orders of magnitude cheaper, runs on a laptop, keeps your proprietary data private, and was built using the exact same skills we’re learning by building a Turkish LLM: tokenizer design, architecture selection, training pipeline optimization.
What this means: the doors that opened
The moment we understood this, the project’s scope transformed from “build one Turkish LLM” to “learn to build any sequence model for any domain.” The possibilities:
- Every factory, every lab, every hospital, every trading desk has sequential data
- Each could have its own tiny model (10M–100M parameters)
- These models would be more accurate than general LLMs for their specific domain
- Cheaper to train (hours, not months), cheaper to run (laptop, not data center)
- Private — your data never leaves your building
- And we now know how to build them — because the LLM project teaches the entire craft
The trap: building gods instead of tools
The AI industry is pouring billions into building an omniscient conversational entity — a digital god that answers everything through natural language. Every problem becomes “talk to the AI.”
But now we see clearly: most valuable real-world problems don’t need conversation. They need prediction, pattern recognition, optimization. The “talking” layer is expensive overhead when your actual need is “will this cable pass the tensile test?”
Building a full LLM when you need a domain predictor is like building a 747 when you need a bicycle. The bicycle is simpler, cheaper, and gets you where you’re going faster — if you’re going to the shop.
The orchestra vision
The future is not one massive model. It is orchestration: multiple small, specialized models working together, each optimal for its domain.
Routes requests
Logic & decomposition
Domain knowledge
Calculator, code, search
We are building the reasoner. The specialists can be the factory models, the medical models, the financial models — each tiny, each accurate, each built with the same skills we are learning right now.
6. HOW IT ACTUALLY WORKS: STEP BY STEP, DOMAIN BY DOMAIN
Section 5 claimed every domain is a language and every sequence can be tokenized. That might still feel abstract. So let’s make it concrete. Below are mock walkthroughs showing exactly what happens inside the machine — from raw input to final output — for five different domains. The process is identical every time. Only the tokens change.
① Language model (Turkish LLM)
Ankara'nın nüfusu kaçtır?Step 1 — Tokenize (text → numbers). The tokenizer looks up each piece in its 64K vocabulary:
“Ankara” → 3847 | “'nın” → 129 | “nüfusu” → 8412 | “kaçtır” → 5903 | “?” → 30The model receives:
[3847, 129, 8412, 5903, 30]. It has no idea these are Turkish words. It sees five integers.Step 2 — Model processes (numbers → numbers). The transformer takes those 5 integers, converts each to a 2048-dimensional vector, passes them through 22 layers of attention and feed-forward networks. At the end, it outputs a probability distribution over all 64,000 tokens: “which token is most likely next?” It picks token
11297.Step 3 — Detokenize (numbers → text). The tokenizer looks up
11297 in its vocabulary:
11297 → “Yaklaşık”. This is appended to the output.Step 4 — Repeat. Now the model sees
[3847, 129, 8412, 5903, 30, 11297] and predicts the next token.
Then the next. Then the next. Token by token, the answer builds up:11297 → “Yaklaşık” | 642 → “5” | 1830 → “milyon” | 7741 → “kişidir” | 4 → “.”Final output:
Yaklaşık 5 milyon kişidir.Bonus — what if we ask:
Ankara’nın başkenti nedir?The tokenizer finds every word in its 64K vocabulary. The model processes the token sequence and generates an answer token by token:
“Ankara bir başkenttir, bir ilin başkenti değildir.”
It works. This model was built for Turkish text. Turkish words are its native tokens.
Conversation is literally what it was trained to do.
② Music model
Step 1 — Tokenize (notes → numbers). A chord progression is encoded:
“C_maj” → 42 | “quarter” → 7 | “G_maj” → 58 | “quarter” → 7 | “Am” → 51 | “quarter” → 7 | “F_maj” → 47 | “quarter” → 7Model receives:
[42, 7, 58, 7, 51, 7, 47, 7]. No words. No language. Just integers representing a I–V–vi–IV progression.Step 2 — Model processes. Transformer predicts: after this progression, token
42 is most likely next.Step 3 — Detokenize (numbers → notes).
42 → “C_maj”. The progression resolves back to the tonic.Step 4 — Repeat. Next token:
12 → “half” (half note duration).
Then: 71 → “E4” (melody note). Token by token, a melody is composed.No words were involved at any step. The model “speaks” music. Its vocabulary is notes. Its output is a playable MIDI sequence.
Bonus — what if we ask:
Ankara’nın başkenti nedir?Step 1 crashes immediately. The tokenizer tries to look up “Ankara” in its vocabulary. Its vocabulary contains
C_maj, quarter, E4, rest — notes,
durations, chords. No Turkish words. No words of any language. “Ankara” does not exist.
“Başkent” does not exist. “Nedir” does not exist. The input cannot even be converted
to numbers. There is nothing to feed the model. It is like trying to insert a Turkish sentence into a piano roll.
Not a wrong answer — no answer is possible. The model has never seen a word. It does not know
what a word is. It does not know what a question is. It does not know what “conversation” means.
③ Protein model
Step 1 — Tokenize (amino acids → numbers). A protein fragment:
“M” → 1 | “A” → 5 | “L” → 10 | “W” → 17 | “K” → 9 | “L” → 10 | “P” → 12Model receives:
[1, 5, 10, 17, 9, 10, 12]. No English. No Turkish. Just amino acid IDs.Step 2 — Model processes. Given this sequence, the transformer predicts the next amino acid. It outputs a distribution over 25 tokens. Highest probability: token
4.Step 3 — Detokenize (numbers → amino acids).
4 → “V” (Valine).
The protein chain grows.Step 4 — Repeat. The model continues until it predicts the “END” token. The output is a complete protein sequence that can be analyzed for folding, binding, or function.
Vocabulary: 25 tokens. No human language. Just biochemistry as a sequence.
Bonus — what if we ask:
Ankara’nın başkenti nedir?Step 1 crashes. The tokenizer’s entire vocabulary is:
M, A, L, W, K, P, V, G, I, F, Y, C, H, R, N, D, E, Q, S, T, START, END, PAD, UNK, MASK.
Twenty-five tokens. All amino acids. “Ankara”? The tokenizer might match individual letters
— A, n, k, a, r, a — but “n” is not an amino acid. “k” is not an amino acid.
Most characters map to UNK (unknown). The model receives a string of unknowns and random amino acid
matches: [UNK, 5, UNK, UNK, UNK, 5, UNK, UNK, UNK...]. If forced to run, it will output a random
protein fragment — not an answer, not a sentence, just meaningless amino acid noise.
It has no concept of language, questions, or communication.
④ Cable factory model
Step 1 — Tokenize (production data → numbers). An engineer enters a new production setup:
“[MATERIAL]” → 1 | “PVC_A7” → 34 | “[TEMP]” → 2 | “175” → 412 | “[SPEED]” → 3 | “15” → 287 | “[PRESSURE]” → 4 | “8” → 193 | “[PREDICT]” → 5Model receives:
[1, 34, 2, 412, 3, 287, 4, 193, 5]. Not a sentence. A structured production specification.Step 2 — Model processes. Transformer outputs token
601.Step 3 — Detokenize (numbers → results).
601 → “tensile_PASS”.Step 4 — Repeat. Next tokens:
622 → “elongation_420” | 709 → “flame_V0” | 685 → “shore_85”.Final output:
tensile_PASS elongation_420 flame_V0 shore_85The engineer now knows — before manufacturing — that this setup will pass all tests. 800 tokens. 10M parameters. Runs on a laptop. No human language at any step.
Bonus — what if we ask:
Ankara’nın başkenti nedir?Step 1 crashes. The tokenizer knows:
[MATERIAL], PVC_A7,
[TEMP], 175, [SPEED], tensile_PASS — 800 tokens,
all production codes and test results. Not a single human word. “Ankara” is not a material.
“Başkent” is not a machine setting. “Nedir” is not a test result.
The input cannot be tokenized. Even if you forced random token mappings, the model would output something like
shore_72 elongation_310 flame_V1 — a meaningless production prediction.
It has never encountered a human sentence in its entire existence.
It doesn’t know humans exist. It knows cables.
⑤ DNA / Genomics model
Step 1 — Tokenize (nucleotides → numbers). A gene fragment:
“A” → 1 | “T” → 2 | “G” → 3 | “C” → 4 | “G” → 3 | “A” → 1 | “T” → 2Model receives:
[1, 2, 3, 4, 3, 1, 2]. Seven numbers. The model doesn’t know what DNA is.Step 2 — Model processes. Given this context, the transformer predicts: token
4 (C) is most likely next.Step 3 — Detokenize.
4 → “C”.Step 4 — Repeat. The model generates the rest of the sequence, which can then be analyzed for gene function, mutation risk, or regulatory patterns.
Vocabulary: 7 tokens. The smallest possible “language.” Same transformer. Same process.
Bonus — what if we ask:
Ankara’nın başkenti nedir?Step 1 crashes. The vocabulary is:
A, T, C, G, START, END, UNK. Seven tokens.
“Ankara” becomes [A, UNK, UNK, A, UNK, A] — it can only see the letter A because
Adenine happens to share that symbol. The rest is unknown. The model would output something like
T G C A A T G C — a DNA sequence fragment. Not a word. Not a sentence.
A string of nucleotides. It has never seen a human language. It has seven tokens.
It cannot even represent the alphabet, let alone form a thought.
1. Domain input → tokenizer → integer sequence
2. Integer sequence → transformer → predicted next integer
3. Predicted integer → tokenizer (reverse) → domain output
4. Repeat until done
And the Bonus examples reveal something even more important:
A domain-specific model does not “speak.” It does not know what human language is. It does not know what a question is. It does not know what a conversation is. It has never seen a word. When you type
Ankara’nın başkenti nedir? into a music model, the input cannot
even enter the machine — the tokenizer has no mapping for human words. When you force it into a protein
model, you get random amino acids back. When you force it into a factory model, you get cable test results.
When you force it into a DNA model, you get nucleotides.This is the critical distinction: an LLM is just one type of transformer model — one where the tokenizer happens to map human words to numbers, and the training data happens to be human conversations and text. That’s what gives it the ability to “talk.” Remove the word-based tokenizer, train on MIDI files instead of Wikipedia, and you get a model that composes music but couldn’t say “hello” if its life depended on it. The transformer engine is identical. The tokenizer decides what world the model lives in.
People know that LLMs convert words to numbers internally. What they often miss is that domain-specific models don’t convert words to numbers — they were never designed to receive words at all. Their tokenizer speaks a completely different language: notes, amino acids, machine codes, nucleotides. They don’t “know about” their domain through language — they think in their domain’s native tokens, the way an LLM thinks in words.
Now picture the damage the LLM hype causes in practice. A cable factory needs to predict test results for a new material-and-machine configuration. The “AI = LLM” mindset says: build (or buy) a language model. So they start. Phase 1: train a tokenizer on text — weeks. Train the base model on billions of words so it learns to talk — months, hundreds of thousands of dollars in compute. Phase 2: fine-tune it on domain documents — more weeks, more failed runs, more cost. Phase 3: reinforcement learning to improve accuracy — more days, more weeks. And after all of that, what is the actual input to this colossal system? A chat message:
“Hello, the materials are XLPE, CAT113, RAL9100 dye. Machine settings: extruder speed 12,
temperature 185, pressure 8. What will the test results be?”Read that input again. Really read it. You spent months teaching a machine to understand human language, just to type a sentence that is already structured data pretending to be a conversation. The model now has to parse your natural language back into the structured values you already had, hope it doesn’t hallucinate, and produce a natural-language answer that you then have to parse again to extract the actual numbers. You added an entire human-language layer — costing months and fortunes — as a detour around the direct path.
The direct path? A domain tokenizer with 800 tokens. Input:
[1, 34, 2, 412, 3, 287, 4, 193, 5].
Output: tensile_PASS elongation_420 flame_V0 shore_85. No conversation. No parsing.
No hallucination. 10M parameters. Trained in hours on actual production records. Runs on a laptop.
The entire LLM pipeline — months of pretraining, fine-tuning, reinforcement learning, prompt
engineering — existed only to add a chat interface on top of what should have been a direct
sequence-to-sequence prediction. That is the cost of not understanding tokenization.This is why understanding tokenizers was the most important first step of our journey. It wasn’t just about Turkish morphology. It was about understanding that the tokenizer is the entire interface between any domain and the machine that learns from it. Change the tokenizer, change the world the model inhabits. The engine stays the same.
An LLM is a human-language-domain-specific transformer. Nothing more, nothing less. It is not “artificial intelligence.” It is one application of a sequence-learning architecture to one particular domain: human text. AI is not equal to LLM.
Once tokenization is truly understood, this stops being a semantic argument and becomes an engineering revelation. It is not about “talking Turkish to an English protein model.” A protein model does not talk at all — not in Turkish, not in English, not in any human language. It communicates in amino acid sequences. A factory model communicates in production codes. A music model communicates in notes. These are entirely different modes of communication, as alien to human language as sonar is to speech.
And this is exactly why the current industry obsession with ever-larger LLMs is a dead end for real-world problems. A 500-billion-parameter model that “talks” impressively is spectacular as a demo. But ask it to predict whether a cable will pass a tensile test given specific extrusion parameters, and it will hallucinate a plausible-sounding paragraph that is entirely wrong — because it has never seen a production record. It learned language patterns, not physics. Studies consistently show that roughly 95% of enterprise LLM implementations fail to deliver real value. The reason is not that the technology is bad. The reason is that the tool is wrong for the job. Companies are trying to solve domain-specific sequence problems with a human-conversation machine — and then wondering why it doesn’t work.
The tragedy is that this failure is often blamed on “AI not being ready,” when in fact AI is ready — just not in the form most people have been sold. A 10-million-parameter domain model with 800 tokens, trained on actual production data, will outperform a trillion-parameter LLM on that domain every single time — at a fraction of the cost, running on a laptop, with no hallucinations, because every token in its vocabulary maps to something real.
The hype conflated “AI” with “chatbot,” and that conflation costs industries billions. Understanding tokenization is the way out. Once you see that the transformer is a universal engine and the tokenizer is a swappable lens, the entire landscape changes. The question is no longer “how do I make the LLM understand my factory?” The question becomes: “what tokenizer does my factory need?”
7. WHAT THE ARCHITECTURE TAUGHT US ABOUT REASONING
If Sections 5 and 6 showed us that the transformer is a universal sequence learner — same four steps, any domain — this section asks: how does a sequence learner develop something that looks like reasoning? Understanding architecture required understanding what “reasoning” actually means inside a neural network — and what it doesn’t. Remember: everything below applies not just to LLMs, but to any sequence model — the same mechanisms that let a language model “reason” about Turkish let a protein model “reason” about folding. (A detailed architecture research page will follow, like the tokenizer report.)
The training pipeline (sequential, not a choice)
“Training” — learn language & patterns
“Fine-tuning” — learn format
Reinforcement learning — learn reasoning
These are not alternatives. They’re sequential phases, each teaching fundamentally different things:
| Phase | Input | Algorithm | What It Teaches |
|---|---|---|---|
| Pretraining (what people usually call “training”) | Raw text (no QA pairs) | Predict next token at every position | Language, facts, reasoning patterns |
| SFT (what people usually call “fine-tuning”) | Clean instruction-response pairs | Same (next-token prediction) | How to follow instructions. NOT reasoning. |
| RLVR (reinforcement learning with verifiable rewards) | Problems with verifiable answers | Generate → verify → reward/penalize | Self-correction, decomposition, genuine reasoning |
What generalizes vs what doesn’t
| Capability | How Learned | Generalizes? |
|---|---|---|
| Facts (“Ankara is capital”) | Memorized from data | No — only knows what it saw |
| Small arithmetic (2+3=5) | Pattern memorization | Partially (up to ~4–5 digits) |
| Large arithmetic (234871...+12309...) | Would need precise computation | No — LLMs fail reliably |
| Logical structure (A→B, B→C ⇒ A→C) | Learns abstract transformation in vector space | Yes — transfers to new content |
| Problem decomposition | Learns structural pattern | Yes — transfers across domains |
| Tool use (“this needs a calculator”) | Learns WHEN to delegate | Yes — genuine generalization |
The model doesn’t memorize “2+3=5.” It learns the structure of addition from thousands of examples. For small numbers, this works. For large numbers, it fails — because precise multi-digit carry operations exceed what next-token prediction can reliably do. The real generalization is knowing WHAT to do (“this needs a calculator”), not doing the computation itself.
How self-correction works (mechanistically)
An LLM doesn’t “realize” errors the way we do. At each token position, the attention mechanism can attend to all previous tokens. As more context accumulates, inconsistencies become statistically detectable — the probability distribution shifts toward correction tokens. “Backtracking” is not real backtracking: the model generates new tokens that redirect (“wait, that’s wrong…”). The wrong tokens remain in context.
This self-correction ability comes from RL training, not from seeing error-correction patterns in data. RL rewards reasoning chains that self-correct AND reach correct answers. The model discovers that “check your work” is a rewarding strategy.
Q: Is it reasoning, or is it mimicry?
The honest answer: we don’t know. The model learns reasoning patterns from data. When it encounters a new problem, it applies those patterns. Is that “real reasoning” or “sophisticated pattern matching”? The debate is unresolved. Evidence is mixed: models solve novel problems (suggesting generalization beyond mimicry) but also fail on trivially modified versions of problems they ace (suggesting pattern matching).
Our practical answer: the distinction may not matter. What matters is: does the model arrive at correct answers on novel problems? That’s measurable. RLVR pushes the model from shallow mimicry toward robust application by rewarding correctness, not looking correct.
Q: What is “correct” for a language model?
| Domain | What “Correct” Means | Verifiable? |
|---|---|---|
| Math | The answer is right (2+2=4) | Yes |
| Code | It compiles and passes tests | Yes |
| Logic | Conclusion follows from premises | Mostly yes |
| General language | Coherent, relevant, preferred by humans | No — subjective |
Q: LLM reasoning = search algorithms?
An insight from the architecture discussion: LLM self-correction resembles tree search (explore paths, evaluate, redirect). But with critical differences:
- The tree doesn’t exist beforehand — it’s generated token by token
- No real backtracking — only forward corrections (“wait, that’s wrong…”)
- For general language, there is no “correct” node — only for verifiable domains (math, code, logic)
Research formalizes this as Tree of Thoughts, Process Reward Models, and MCTS for LLMs. The analogy holds structurally but breaks mechanistically. Still, it implies: small models can “search” well if given sufficient thinking budget (extended thinking = bigger search budget).
8. DESIGN PHILOSOPHY: LESS IS MORE
We are not committed to a single scale. We are open to 100M, 360M, 1B, 2B, 3B, 4B — and “less is not limited.” The belief: with extremely optimal architecture and pretraining, smaller models can match or approach bigger ones.
As we discovered in Section 5, this extends far beyond LLMs. Every domain with sequential structure can have its own tiny, accurate model. The world is moving toward specialized models combined into orchestras — and we are positioned to build them.
9. IMPORTANT DECISIONS (LOCKED)
| Decision | Choice | Rationale |
|---|---|---|
| Tokenizer | 64K BPE v3 (our own) | ~14% better than Kumru/TabiBERT, ~2.7× better than GPT-4 |
| Architecture | Decoder-only | Standard for generative reasoning LLMs; encoder can be separate component |
| Parameter range | 100M–4B | “Less is more” — optimal architecture can punch above its weight |
| Context length | 128K tokens | Process entire legal cases, theses, books in one pass |
| Position encoding | RoPE under question | Prior fine-tuning showed terrible long-context results with RoPE. Prefer ALiBi/learned or validated fix. |
| Training pipeline | Pretraining (training) → SFT (fine-tuning) → RLVR (reinforcement learning) | Sequential phases, each teaches different things. Not a choice. |
| SFT data quality | Crystal-clear only | Confirmed: mistakes in SFT data = model learns to produce mistakes |
| Literature search | Required before deep decisions | Use arXiv, HF, ACL, not just Google. Avoid overconfident outdated advice. |
10. DATA STRATEGY
This section focuses on data for our Turkish LLM. But through the lens of Section 5: everything described below is a template. Swap “Turkish text” for “protein sequences” or “production logs,” and the same pipeline structure applies — just with a different tokenizer and different domain data.
Pretraining data — “training” (quantity, diverse)
Raw Turkish text — no QA pairs, no formatting. The model reads continuous text and predicts the next token at every position.
| Source | Purpose | Split |
|---|---|---|
| Turkish Wikipedia, news, books, forums | Language structure, grammar, fluency | 80–90% Turkish 10–20% English |
| Legal, medical, scientific, financial text | Domain vocabulary, formal reasoning | |
| Code (Python, etc.) | Logical structure, precise reasoning | English helps with cross-lingual transfer |
| Math texts, scientific papers | Reasoning patterns, formal arguments |
SFT data — “fine-tuning” (quality, clean)
Turkish instruction-response pairs. Clean, no mistakes. Teaches format, not reasoning.
RLVR data — reinforcement learning (verifiable problems)
Math (GSM8K-style, competition math), code problems, logic puzzles. Can be translated to Turkish.
Math and logic are language-light — 17 × 23 = ? works in any language.
This is where reasoning is actually trained.
11. WHAT’S NEXT
Even if the final model isn’t the best in the world, the person who deeply understands every layer of the stack is more dangerous than the person who trains the biggest model. The biggest model is just money. Understanding is leverage.
12. REPO SNAPSHOT
| Path | What It Contains |
|---|---|
tokenizers/turkish_bpe_64k/ | Selected tokenizer (64K BPE v3) |
tokenizers/turkish_bpe_{16k,32k,48k}_*/ | All experimental versions preserved |
tokenizers/kumru_2b_reference/ | Kumru baseline for comparison |
data/processed/ | 22 GB training corpus (27 files, 11 domains) |
train_tokenizer.py | Tokenizer training script |
benchmark_tokenizers.py | 104-sentence benchmark (21 core + 83 hard/edgy) |
docs/tokenizer-research.html | Full tokenizer research report (EN) |
docs/tokenizer-research_tr.html | Full tokenizer research report (TR) |
docs/project-context.html | This file — the journey document |
reference_architecture/ | Config examples, literature review, README |
PROJECT_CONTEXT.md | Machine-readable project context (for AI sessions) |
It also broke the biggest illusion in the industry: that AI equals LLM. It doesn’t. An LLM is a human-language-domain-specific transformer — one application of a universal engine to one particular domain. Once you see that, you see why trillion-parameter chatbots fail at factory floors, why 95% of enterprise LLM projects collapse, and why the answer was never “make the LLM bigger.” The answer is: build the right tokenizer for the right domain, and let a tiny model do what a giant one never could. We went from byte-pair encoding to Nietzsche to industrial economics in a single conversation.
The tokenizer opened the first door. Architecture opened the second. There are more doors ahead — pretraining, SFT, RLVR, orchestration, domain models. Each one will teach something that no paper or course can: the understanding that comes from building it yourself, hitting walls, and figuring out why. And every lesson will reinforce the same truth: the transformer is the engine, the tokenizer is the lens, and the world is full of domains waiting for their own small, precise, purpose-built models.
This is a living document. It will grow with every phase completed, every decision made, every insight earned.
© 2026 • Independent Research • Tokenizer Report • Tokenizer Raporu (TR)