ARCHITECTURE & PRETRAINING
Phase 2 & 3 — Two Generations of Turkish Transformers: 24.7M (v1) & 67.6M (v2) from Scratch
d_model=512, 2:1 GQA, 4.2× more transformer parameters, optimized as a context
converter rather than a knowledge base. All rounds ran on NVIDIA H100 ($2.38/hour) with bfloat16
mixed precision and torch.compile. Total v1 pretraining cost: approximately $92.83.
TABLE OF CONTENTS
1. MOTIVATION: FROM TOKENIZER TO MODEL
Phase 1 produced a 64K Turkish BPE tokenizer that is ~14% more efficient than existing Turkish tokenizers and ~2.7× more efficient than GPT-4 on Turkish text. The natural question follows: can this tokenizer serve as the foundation for a purpose-built Turkish language model?
The objective is not to compete with production LLMs. The objective is threefold: (1) validate the tokenizer in an end-to-end training pipeline, (2) establish every component of the training infrastructure from scratch, and (3) produce models that demonstrably learn Turkish language patterns and can serve as domain-specific RAG assistants. This led to two generations: v1 (24.7M params) to prove the pipeline works, and v2 (67.6M params) purpose-built for RAG context comprehension.
2. V1 ARCHITECTURE OVERVIEW
The v1 model is a decoder-only Transformer with 24,697,088 parameters. Every component was selected against specific alternatives; no default was accepted without justification. All architectural decisions below (ALiBi, GQA, SwiGLU, RMSNorm, weight tying) carry forward to v2 — only the dimensions change.
| Component | Choice | Rejected Alternative | Rationale |
|---|---|---|---|
| Architecture | Decoder-only | Encoder-decoder, Encoder-only | Autoregressive generation is the goal; encoder stack adds unnecessary cross-attention |
| Position encoding | ALiBi | RoPE, Learned, Sinusoidal | Zero learned params; train-short-test-long generalization; RoPE unreliable at extrapolation |
| Attention | GQA (4:1) | MHA (8:8), MQA (8:1) | 75% KV parameter reduction vs MHA; retains multi-view capacity unlike MQA |
| FFN activation | SwiGLU | ReLU, GELU, GeGLU | Gated mechanism outperforms per-parameter; 3 projections at (8/3)×d maintains budget |
| Normalization | RMSNorm | LayerNorm, BatchNorm | ~10-15% faster per layer; mean subtraction is redundant with pre-norm residual |
| Norm placement | Pre-norm | Post-norm | Unobstructed residual gradient path; stable training without careful LR tuning |
| Output projection | Weight tying | Separate lm_head | Saves 16.4M parameters (66% of model); embedding serves dual purpose |
| Linear layers | No bias | With bias | Redundant with RMSNorm re-centering; simplifies weight decay |
| Dropout | 0.0 | 0.1–0.3 | Pretraining goal is to absorb data, not regularize; underfitting is the risk |
Configuration
| Parameter | Value | Derivation |
|---|---|---|
vocab_size | 64,000 | Phase 1 tokenizer; 64K × 256 = 16.4M embedding params |
d_model | 256 | Minimum for head_dim=32 with 8 heads |
n_layers | 12 | Depth over width at small scale (SmolLM2 finding) |
n_heads | 8 | 8 attention patterns; head_dim = 256/8 = 32 |
n_kv_heads | 2 | GQA 4:1 ratio; 4 query heads share each KV head |
d_ff | 688 | ≈ (8/3) × 256; preserves FFN param budget with SwiGLU’s 3 matrices |
max_seq_len | 512 | ALiBi generalizes beyond training length; 512 is memory-safe on consumer GPU |
dropout | 0.0 | No regularization during pretraining |
weight_tying | true | Embedding = LM head; saves 16.4M params |
3. POSITIONAL ENCODING: ALiBi (NOT ROPE)
Positional encoding informs the model where tokens are in the sequence. The dominant approach in 2024–2026 is Rotary Position Embeddings (RoPE), used by Llama, Mistral, and Qwen. This work rejects RoPE in favor of ALiBi (Attention with Linear Biases, Press et al. 2022).
Why not RoPE
RoPE applies rotation matrices to query and key vectors based on position. While mathematically elegant, it exhibits two practical problems: (1) poor extrapolation beyond training length without ad-hoc scaling hacks (NTK-aware, YaRN, etc.), and (2) additional learned parameters that interact with the attention computation in ways that are difficult to debug at small scale. Prior empirical observation on this project confirmed unstable long-context behavior with RoPE.
ALiBi mechanism
ALiBi adds a linear distance penalty directly to attention scores. No parameters are learned. Each attention head receives a different slope (geometric sequence), creating a spectrum from sharp local attention to broad distant attention.
| Property | ALiBi | RoPE | Learned | Sinusoidal |
|---|---|---|---|---|
| Learned parameters | 0 | Implicit (rotations) | seq_len × d_model | 0 |
| Length generalization | Train short, test long | Requires scaling hacks | Hard-capped | Degrades |
| Implementation complexity | Low (bias matrix) | Medium (rotations) | Low | Low |
| Industry adoption | MPT, BLOOM | Llama, Mistral, Qwen | GPT-2 | Original Transformer |
Slope computation
For n attention heads, slopes form a geometric sequence: slopei = 2-8i/n for i ∈ {1, ..., n}. With 8 heads, slopes range from 0.5 (sharp, local focus) to 2-8 ≈ 0.0039 (broad, distant attention).
| Head | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|
| Slope | 0.5 | 0.25 | 0.125 | 0.0625 | 0.0313 | 0.0156 | 0.0078 | 0.0039 |
| Behavior | Strong local focus | Medium range | Broad / distant | |||||
-inf. The causal mask was effectively leaking — the model could attend to future tokens
with a soft distance penalty rather than being blocked entirely. This would have produced a model that
appears to train normally but fails catastrophically at inference (where future tokens are unavailable).
The bug was caught by test #22 of the validation suite (Section 11) and corrected before training began.
4. GROUPED QUERY ATTENTION
Standard Multi-Head Attention (MHA) assigns independent Key and Value projections to each attention head. Grouped Query Attention (GQA, Ainslie et al. 2023) shares KV projections across groups of query heads, reducing memory and parameter cost without proportional quality loss.
| Variant | Q Heads | KV Heads | KV Params/Layer | Quality |
|---|---|---|---|---|
| MHA (8:8) | 8 | 8 | 131,072 | Maximum |
| GQA (8:2) | 8 | 2 | 32,768 | Near-MHA |
| MQA (8:1) | 8 | 1 | 16,384 | Degraded |
The 4:1 ratio saves ~98K parameters per layer × 12 layers = 1.18M parameters total
versus MHA. At 24.7M total, this represents a 4.8% budget reallocation. Four query heads share each KV head
via tensor repetition (_repeat_kv): the KV tensor is expanded along the head dimension
without copying data.
Projection dimensions
| Projection | Shape | Parameters |
|---|---|---|
| Q (query) | 256 × 256 | 65,536 |
| K (key) | 256 × 64 | 16,384 |
| V (value) | 256 × 64 | 16,384 |
| O (output) | 256 × 256 | 65,536 |
| Total attention/layer | 163,840 |
5. SwiGLU FEED-FORWARD NETWORK
The standard Transformer FFN uses two projections with a nonlinear activation: FFN(x) = W2 · σ(W1 · x). SwiGLU (Shazeer 2020) replaces this with a gated variant using three projections: FFN(x) = Wdown · (SiLU(Wgate · x) ⊙ Wup · x).
Parameter equivalence
Standard FFN with expansion factor 4: 2 × d × 4d = 8d². SwiGLU with 3 matrices: 3 × d × dff. Setting 3 × dff = 8d gives dff = (8/3)d ≈ 688 for d = 256. Total FFN parameters per layer:
| FFN Type | Matrices | d_ff | Params/Layer |
|---|---|---|---|
| Standard (ReLU/GELU) | 2 | 1,024 | 524,288 |
| SwiGLU | 3 | 688 | 528,384 |
Nearly identical parameter budget (+0.8%), empirically superior activation function. The gating mechanism allows the network to selectively amplify or suppress information — a capability that standard FFNs lack.
6. NORMALIZATION & RESIDUAL DESIGN
RMSNorm over LayerNorm
LayerNorm (Ba et al. 2016) performs two operations: mean subtraction and variance normalization.
RMSNorm (Zhang & Sennrich 2019) drops the mean subtraction entirely, computing only the root mean square.
With 12 layers × 2 norms per layer = 24 norm operations per forward pass, the cumulative speedup
is measurable. Each RMSNorm has exactly d_model = 256 learnable parameters (the scale vector).
Pre-norm residual structure
Each Transformer block follows the pattern:
x = x + Attention(RMSNorm(x)) # residual path is unobstructed x = x + FFN(RMSNorm(x)) # gradient flows directly through addition
The alternative — post-norm — places the normalization after the residual addition:
x = RMSNorm(x + Attention(x)). This creates a gradient bottleneck through the norm layer.
Pre-norm eliminates this bottleneck, producing more stable training in deep networks without
requiring careful learning rate tuning.
7. V1 PARAMETER BUDGET ANALYSIS
At 24.7M parameters, every allocation decision is visible. The embedding layer dominates — a structural consequence of pairing a large vocabulary (64K) with a small hidden dimension (256).
| Component | Parameters | % of Total | Notes |
|---|---|---|---|
| Token embedding (tied) | 16,384,000 | 66.3% | 64,000 × 256; shared with output projection |
| Attention (all layers) | 1,966,080 | 8.0% | 163,840 per layer × 12 |
| FFN / SwiGLU (all layers) | 6,340,608 | 25.7% | 528,384 per layer × 12 |
| RMSNorm (all layers + final) | 6,400 | <0.1% | 256 per norm × 25 norms |
| LM head | 0 | 0% | Weight tying: reuses embedding |
| Total | 24,697,088 | 100% |
Weight initialization
All 2D weight matrices are initialized with Normal(0, 0.02). Output projections
(o_proj, down_proj) are additionally scaled by 1/√(2 × n_layers) =
1/√24 ≈ 0.204 to prevent residual stream explosion through 12 layers of additive
contributions. This follows the GPT-2 initialization scheme.
8. TRAINING CORPUS CURATION
The full Phase 1 corpus spans 22 GB across 11 domains. For pretraining a 24.7M model, a curated subset of ~440 MB (raw text) was selected. The selection criteria: (1) prioritize raw text over structured data, (2) include reasoning-oriented material, (3) exclude instruction-tuning data (reserved for SFT in Phase 4), (4) cap Wikipedia to prevent encyclopedic dominance.
| File | Raw Size | Tokens | % of Corpus | Selection Rationale |
|---|---|---|---|---|
| wikipedia_tr.txt | 310 MB (capped from 866 MB) | 67,407,762 | 67.7% | General knowledge, encyclopedic Turkish |
| orca_math_tr.txt | 117 MB | 28,648,305 | 28.8% | Mathematical reasoning (north star objective) |
| tdk_full.txt | 9.5 MB | 2,398,611 | 2.4% | Official dictionary; proper word usage |
| turkish_folk_songs.txt | 2.3 MB | 653,972 | 0.7% | Colloquial/emotional Turkish |
| turkish_idioms_proverbs.txt | 1.6 MB | 369,462 | 0.4% | Idiomatic language, cultural knowledge |
| turkish_poems.txt | 0.4 MB | 100,141 | 0.1% | Literary Turkish, complex grammar |
| turkish_mmlu_exams.txt | 0.2 MB | 40,187 | <0.1% | Academic/exam format, broad vocabulary |
| literary_short_stories.txt | 0.1 MB | 18,596 | <0.1% | Narrative structure, dialogue |
| Total | ~441 MB | 99,637,036 | 100% |
Excluded data in Round 1 (later used in Round 2)
| File | Size | R1 Exclusion Rationale | R2 Status |
|---|---|---|---|
| instruc_turca.txt | 3.7 GB | Instruction data; format learning before language learning is counterproductive | Included |
| rag_dataset_tr.txt | 31 MB | ERP-domain-specific; too narrow for general pretraining | Included |
| Wikipedia (remaining) | 556 MB | Capped at 300 MB to prevent encyclopedic bias | Uncapped |
| Academic, legal, medical… | ~16 GB | Not yet collected at time of Round 1 | Included |
9. DATA PIPELINE
The pipeline converts raw text to training batches in two offline steps, followed by real-time random sampling during training.
Step 1: Tokenization (offline, one-time)
Raw .txt files are split into documents by double newline, filtered (minimum 20 characters,
5 tokens), wrapped with BOS/EOS markers, and tokenized with the 64K BPE tokenizer. The resulting integer
sequence is stored as a contiguous uint16 binary file (190 MB). The choice of uint16
is exact: vocab_size = 64,000 < 65,535 (max uint16), using exactly 2 bytes per token with no waste.
Step 2: Memory-mapped random sampling
The binary file is memory-mapped via numpy.memmap. Each training sample is drawn by selecting
a random starting position and extracting a contiguous window of seq_len + 1 = 513 tokens.
The first 512 tokens serve as input; the last 512 (shifted by one) serve as targets. This approach has
no concept of epochs — with 99.6M tokens and 512-token windows, the number of possible starting
positions (~194K) vastly exceeds any practical training run.
| Pipeline Stage | Input | Output | Time |
|---|---|---|---|
| Tokenize (64K BPE) | 441 MB raw text (8 files) | 190 MB uint16 binary | ~2.5 min |
| Memory map | 190 MB binary | Random access via OS page cache | Instant |
| Sample | Random offset | (512 input, 512 target) tensor pair | <1 ms |
10. V1 TRAINING CONFIGURATION
| Hyperparameter | Value | Justification |
|---|---|---|
| Optimizer | AdamW | Decoupled weight decay; industry standard for Transformers |
| β1, β2 | 0.9, 0.95 | β2=0.95 (not 0.999): faster adaptation to changing gradient landscape |
| Learning rate | 3 × 10-4 | Scaling law: smaller models tolerate higher LR |
| Min learning rate | 3 × 10-5 | 10:1 ratio; prevents complete learning cessation in final steps |
| LR schedule | Cosine decay | More time at peak LR than linear; Chinchilla standard |
| Warmup | 500 steps | ~1% of training; stabilizes optimizer momentum estimates |
| Weight decay | 0.1 | Applied to 2D+ matrices only; norms exempt (their target is 1.0, not 0.0) |
| Gradient clipping | 1.0 (global norm) | Prevents catastrophic updates from gradient spikes; direction preserved |
| Max steps | 50,000 | 3.27B token-reads at batch=128; ~33 passes over corpus |
| Precision | bfloat16 (CUDA) / float32 (MPS) | bfloat16 has float32 range with float16 size; MPS lacks bfloat16 support |
Effective batch size
| Setting | MPS (MacBook M4) | CUDA (H100 NVL) — Round 1 | CUDA (H100 NVL) — Round 2 |
|---|---|---|---|
| Physical batch | 4 | 128 | 128 |
| Gradient accumulation | 8 | 1 | 1 |
| Effective batch | 32 | 128 | 128 |
| Tokens per step | 16,384 | 65,536 | 65,536 |
| Total steps | — | 50,000 | 228,000 |
| Total token-reads | — | 3.27B | 14.94B |
11. VALIDATION SUITE: 60 PARANOID TESTS
Before committing compute to training, every component of the pipeline was validated by a 60-test suite covering 13 categories. The suite was designed to catch the class of bugs that produce models which appear to train normally but fail silently.
| Category | Tests | What It Validates |
|---|---|---|
| 1. Tokenizer | 8 | Loading, vocab size, special tokens, Turkish encoding, roundtrip, uint16 safety |
| 2. Model Architecture | 6 | Parameter count, weight tying on/off, layer count, hidden dimension |
| 3. RMSNorm | 4 | Shape preservation, unit scale, parameter count, zero-input stability |
| 4. ALiBi | 7 | Geometric slopes, exact values, shape, causal mask integrity, diagonal, distance penalty, generalization |
| 5. GQA | 4 | Output shape, 4:1 ratio, projection shapes, causal independence |
| 6. SwiGLU | 3 | Shape, 3 projections, per-layer parameter count |
| 7. Transformer Block | 3 | Shape, residual connections, pre-norm ordering |
| 8. Full Model | 7 | Logits shape, loss shape, initial loss sanity, gradient flow, NaN detection, learning verification |
| 9. Generation | 4 | Token production, max_tokens, determinism, valid ID range |
| 10. Data Pipeline | 5 | File existence, sizes, tokenizer path, directory writability |
| 11. LR Schedule | 3 | Warmup ramp, cosine decay endpoint, monotonic decrease |
| 12. Device | 3 | MPS/CUDA availability, model execution, ALiBi transfer |
| 13. Numerical Stability | 3 | Edge-case IDs, full-length sequences, gradient accumulation equivalence |
| Total | 60 | All passed prior to training |
build_alibi_bias computed relative distances as positions.unsqueeze(0) -
positions.unsqueeze(1), which transposed the query/key axes. Future positions received the
ALiBi distance penalty (e.g., -0.5) instead of hard -inf. This meant the model could
attend to future tokens with a softened penalty rather than being fully masked. Training would appear
normal (loss decreasing, gradients stable) but the model would learn to rely on information that is
unavailable during autoregressive generation. The fix: separate the causal mask (hard -inf
for future) from the distance penalty (soft negative values for past), using
positions.unsqueeze(1) - positions.unsqueeze(0) with explicit clamp(min=0)
on distances before applying masked_fill_.
12. INFRASTRUCTURE: MPS TO H100
Training was initiated on Apple M4 (MacBook Air, passive cooling) to validate the pipeline, then migrated to NVIDIA H100 NVL on RunPod for production training.
| Metric | M4 MacBook Air (MPS) | H100 NVL (CUDA) | Speedup |
|---|---|---|---|
| GPU | Apple M4 (integrated) | NVIDIA H100 NVL 95 GB | — |
| Precision | float32 | bfloat16 (autocast) | 2× throughput |
torch.compile | Not supported | Enabled | ~30-50% speedup |
| Batch size | 4 × 8 accum = 32 | 128 × 1 = 128 | 4× tokens/step |
| Step time | ~3,500 ms | ~140 ms | 25× |
| Tokens/sec | ~5,000 | ~400,000 | 80× |
| VRAM used | ~8 GB (shared) | 72 GB / 95 GB | — |
| GPU utilization | ~100% (throttled to 80°C) | 97% at 53°C | — |
| Estimated total time | ~48 hours | ~2 hours | 24× |
| Estimated cost | Electricity only | ~$4.76 (RunPod, $2.38/hr) | — |
Migration changes
Three modifications were required to move from MPS to CUDA:
- Precision:
float32→bfloat16viatorch.amp.autocast. bfloat16 has the dynamic range of float32 (8 exponent bits) with the memory footprint of float16 (16 bits total), eliminating the need for loss scaling. - Compilation:
torch.compile(model)JIT-compiles the model graph into fused CUDA kernels, eliminating Python overhead and enabling kernel fusion across operations. - Batch scaling: Physical batch increased from 4 to 128; gradient accumulation removed. The H100’s 95 GB VRAM accommodates the full effective batch in a single forward pass.
13. V1 ROUND 1 RESULTS (50K STEPS, 440 MB)
COMPLETE — 50,000 steps, 2.0 hours, final loss 2.62. This section documents Round 1 on the ~440 MB curated subset. Round 2 on the full 22 GB corpus is documented in Section 14.
Loss curve
| Step | Loss | Perplexity | LR | Tokens/sec | Phase |
|---|---|---|---|---|---|
| 10 | 11.07 | ~64,000 | 6.00e-06 | 15,138 | Warmup (random guessing, loss ≈ ln(64000)) |
| 100 | 9.61 | ~15,000 | 6.00e-05 | 96K | Warmup (learning token frequencies) |
| 500 | 6.00 | ~403 | 3.00e-04 | 290K | Warmup complete; peak LR reached |
| 1,000 | 4.82 | ~124 | 3.00e-04 | 355K | Word boundaries, basic grammar |
| 2,000 | 3.93 | ~51 | 2.99e-04 | 399K | Common phrases, suffixes |
| 3,000 | 3.60 | ~37 | 2.98e-04 | 415K | Sentence fragments forming |
| 8,000 | 3.15 | ~23 | 2.85e-04 | 438K | Mathematical notation, equations |
| 10,000 | 3.12 | ~23 | 2.76e-04 | 441K | Wikipedia articles, proper nouns |
| 12,000 | 3.05 | ~21 | 2.66e-04 | 443K | Dictionary format, subordinate clauses |
| 19,000 | 2.94 | ~19 | 2.17e-04 | 446K | Encyclopedic titles, date suffixes |
| 30,000 | 2.78 | ~16 | 1.48e-04 | 449K | Cosine decay phase; diminishing returns |
| 50,000 | 2.62 | ~14 | 3.00e-05 | 451K | Final: filmographies, cultural references |
Generated samples across training
Three hardcoded prompts (“Merhaba”, “Türkiye”, “Stok”) were sampled every 1,000 steps at temperature 0.8 with top-k 40. The prompt is selected randomly each time. No rules were programmed; all linguistic structure was learned from next-token prediction alone.
| Step | Prompt | Output | Observation |
|---|---|---|---|
| 1,000 | Stok | Stokon | Single fragment; knows token boundaries |
| 2,000 | Merhaba | Merhaba | Recognizes greeting; stops at EOS |
| 3,000 | Merhaba | Merhaba, 6. sezon. | First multi-token output; correct grammar and punctuation |
| 8,000 | Stok | Stokes) = 5.000, K + J - 3J = 15.000 | Mathematical notation from orca_math_tr.txt (28.8% of corpus) |
| 10,000 | Merhaba | Merhaba Dünya Kızı, Istanbul’a Gidiyor | Folk song register; correct apostrophe + dative suffix |
| 12,000 | Merhaba | Örnek: Gözlerini bu kadar beğenip, iyi bir şey sevdiğine de biraz daha âşık ol | TDK dictionary format; subordinate clauses with -duğunu |
| 19,000 | Türkiye | Türkiye’deki etnik gruplar, Moğolistan’ın Yahudi tarihi, 1901’de Türkiye | Wikipedia article titles; correct locative/possessive suffixes |
| 50,000 | Merhaba | Cary, “Deli Gömülü” (2001), Hymnogy, “İman İçin Bir Şey” (2002) | Filmography/discography entries with quoted titles and years |
14. V1 ROUND 2: FULL CORPUS (228K STEPS, 22 GB)
Round 1 demonstrated the pipeline worked and the model could learn Turkish from a ~440 MB subset. Round 2 scaled to the full 22 GB corpus — the same corpus used to train the 64K tokenizer in Phase 1 — across all 11 domains: general knowledge, academic, legal, medical, financial, education, news, code, literary, reasoning, and instructions.
Why Round 2?
Round 1 trained on 99.6M tokens (441 MB subset, 67.7% Wikipedia). The model learned grammar and basic vocabulary but lacked domain diversity. Round 2 exposed the model to legal Turkish (court decisions), medical terminology, financial reporting, news journalism, academic writing, and instruction-following patterns — preparing a more robust base for SFT specialization.
Training data: full corpus
| Domain | Sources | Size |
|---|---|---|
| General Knowledge | Wikipedia TR (520K articles, uncapped) | 866 MB |
| Academic/Thesis | BellaTurca AkademikDerlem (668K papers) | 3.5 GB |
| Cultural/Literary Web | BellaTurca ÖzenliDerlem (1.4M curated docs) | 4.4 GB |
| News/Journalism | 1.8M news articles + summarization corpus | 4.5 GB |
| Legal/Law | 700K court decisions + Constitutional Court | 3.7 GB |
| Instructions | 2.5M instruction-answer pairs | 3.7 GB |
| Code | Python corpus | 569 MB |
| Financial | KAP announcements, capital markets | 425 MB |
| Reasoning | Math problems, RAG, chain-of-thought | 221 MB |
| Medical | Medical reasoning + hospital articles | 108 MB |
| Education & Vocabulary | QA, MMLU exams, TDK dictionary, literature | ~100 MB |
| Total | 27 files, 11 domains | 22 GB |
Round 2 configuration changes
| Parameter | Round 1 | Round 2 | Rationale |
|---|---|---|---|
| Training data | 441 MB (8 files) | 22 GB (27 files) | Full corpus for maximum domain coverage |
| Max steps | 50,000 | 228,000 | Scaled proportionally to data volume |
| Batch size | 128 | 128 | Unchanged |
| Starting point | Random init | step_050000.pt | Resume from Round 1 checkpoint |
| Learning rate | 3e-4 → 3e-5 | 3e-4 → 3e-5 | Fresh cosine schedule from peak |
Loss curve: Round 2
| Step | Loss | LR | Tok/s | Sample Quality |
|---|---|---|---|---|
| 50,000 (R1 end) | 2.62 | 3.0e-05 | 451K | Filmographies, encyclopedic entries |
| 95,000 | ~3.60 | 2.0e-04 | 399K | Simple sentences, some repetition |
| 145,000 | ~3.50 | 1.1e-04 | 401K | Coherent multi-clause sentences |
| 195,000 | ~3.47 | 4.4e-05 | 402K | Factual: “Kocaeli’ndeyiz” |
| 200,000 | 3.39 | 4.0e-05 | 403K | Best loss — human rights discussion |
| 228,000 | 3.46 | 3.0e-05 | 403K | Final: real-world knowledge, correct grammar |
Sample evolution: Round 2
| Step | Sample Output | Quality |
|---|---|---|
| 95,000 | “Bu e-postayı seviyorum! Bu e-postanın amacı, bu e-postanın ana no…” | Repetitive, generic |
| 145,000 | “Türkiye’de ve dünyada ekonomik gelişmeler açısından büyük önem taşımaktadır” | Coherent, meaningful |
| 195,000 | “Türkiye’nin en büyük ikinci sanayi şehri konumundaki Kocaeli’ndeyiz” | Factual, specific, correct suffixes |
| 200,000 | “Türkiye’de tüm dünyada ‘insan hakları’ndan söz edildiği gibi…” | Complex topic, proper structure |
15. ROUND 2.5: 2048-CONTEXT FOR RAG (228K STEPS)
Round 2 trained with max_seq_len=512, inherited from the initial architecture.
However, the RAG use case requires processing system prompt + retrieved context chunk + user question
+ generated answer in a single sequence. Typical RAG prompts consume 600–1,300 tokens. A 512-token
model cannot serve RAG — so Round 2.5 retrained the same 24.7M model with 2048-token context.
Why Round 2.5?
Round 2’s 512-token context is a hard ceiling for downstream tasks. The SFT phase needs to fit:
- System prompt (~38 tokens):
ERP sistemi asistanısın. Verilen bağlam bilgilerini kullanarak soruyu yanıtla... - Retrieved context chunk (200–800 tokens): ERP documentation from the RAG retriever
- User question (20–60 tokens): Natural Turkish query
- Generated answer (50–300 tokens): Model’s response
Total: 308–1,198 tokens per turn. A 512-token model would truncate most inputs. ALiBi’s extrapolation property helps, but training at the target context length yields far better attention patterns. 2048 tokens provides comfortable headroom for even the longest multi-chunk RAG prompts.
Configuration changes from Round 2
| Parameter | Round 2 | Round 2.5 | Rationale |
|---|---|---|---|
| max_seq_len | 512 | 2048 | 4× context for RAG prompt fitting |
| dropout | 0.0 | 0.02 | Mild regularization; reduces repetitive generation |
| batch_size | 128 | 8 | 4× longer sequences = 4× more memory per sample |
| grad_accum_steps | 1 | 4 | Effective batch = 32 (8 × 4). Preserves batch scale. |
| Starting point | step_050000.pt | step_228000.pt | Resume from Round 2 final checkpoint |
| Learning rate | 3e-4 → 3e-5 | 3e-4 → 3e-5 | Fresh cosine schedule for context adaptation |
| Training data | 22 GB (27 files) | 22 GB (27 files) | Same corpus, now with 4× longer windows |
batch_size=32 (same effective batch as R2)
triggered CUDA out-of-memory on H100 80GB. The 4× sequence length quadrupled attention memory.
Solution: reduce per-GPU batch to 8, compensate with 4-step gradient accumulation.
Effective batch size stays at 32, but each step takes ~2.5× longer due to the longer sequences
and accumulation overhead: 418 ms/step vs ~165 ms/step in Round 2.
Loss curve: Round 2.5
| Step | Loss | LR | Tok/s | Notes |
|---|---|---|---|---|
| 0 (R2 checkpoint) | ~3.46 | 3.0e-04 | 158K | Starting from R2 final, fresh LR schedule |
| ~50,000 | ~3.40 | 2.6e-04 | 158K | Model adapting to 4× context windows |
| ~100,000 | ~3.35 | 1.9e-04 | 158K | Steady improvement from longer-range attention |
| ~150,000 | ~3.28 | 1.2e-04 | 158K | Cross-sentence coherence improving |
| ~200,000 | 3.22 | 5.0e-05 | 158K | Best loss — long-range dependencies learned |
| 228,000 | 3.33 | 2.0e-05 | 158K | Final: LR at minimum, slight loss uptick |
dropout=0.02 was motivated by observed repetitive
generation patterns during R2 sampling. While the primary goal was RAG context extension, the mild
dropout provides regularization during the subsequent SFT phase where the small training set
(thousands, not billions, of examples) creates overfitting risk. The 0.02 value was chosen conservatively
— enough to break repetition loops without degrading pretraining quality.
16. V2 ARCHITECTURE: 67.6M RAG-OPTIMIZED MODEL
The v1 model (24.7M params) proved that the training pipeline, tokenizer, and infrastructure work. But 24.7M parameters is severely capacity-limited for a RAG assistant that must read context, understand questions, and generate coherent answers. The v2 architecture was designed from scratch with a single principle: this model is a context converter, not a knowledge base.
d_model) matters more than depth (more layers) for context comprehension.
Architecture comparison: v1 vs v2
| Parameter | v1 (24.7M) | v2 (67.6M) | Change |
|---|---|---|---|
d_model | 256 | 512 | 2× representation width |
n_layers | 12 | 12 | Same depth |
n_heads | 8 | 8 | Same query heads |
n_kv_heads | 2 | 4 | 4:1 → 2:1 GQA, richer attention diversity |
head_dim | 32 | 64 | 2× per-head capacity (matches GPT-2, LLaMA standard) |
d_ff | 688 | 1376 | 2× FFN capacity |
max_seq_len | 512 | 2048 | Native RAG context length |
dropout | 0.0 | 0.02 | Proven in R2.5 |
| Embedding params | 16.4M (66.3%) | 32.8M (48.5%) | Balanced, not vocabulary-heavy |
| Transformer params | 8.3M (33.7%) | 34.8M (51.5%) | 4.2× more compute capacity |
| Total params | 24.7M | 67.6M | 2.7× total, 4.2× transformer |
Parameter budget: v2
Per-layer breakdown (2,900,992 params)
| Component | Computation | Parameters |
|---|---|---|
| Q projection | 512 × 512 | 262,144 |
| K projection | 512 × 256 (4 KV heads × 64) | 131,072 |
| V projection | 512 × 256 | 131,072 |
| O projection | 512 × 512 | 262,144 |
| Attention subtotal | 786,432 | |
| Gate (SwiGLU) | 512 × 1376 | 704,512 |
| Up (SwiGLU) | 512 × 1376 | 704,512 |
| Down (SwiGLU) | 1376 × 512 | 704,512 |
| FFN subtotal | 2,113,536 | |
| Norms (attn + ffn) | 512 + 512 | 1,024 |
| Layer total | 2,900,992 |
d_model from 256 to 512 was deliberate. For a context-conversion task, each layer needs
enough representational capacity to attend over long RAG contexts (2048 tokens) and capture the
relationship between question tokens and answer tokens scattered across the context. Wider layers
with 64-dim attention heads (matching LLaMA/GPT-2 standard) provide this. Adding more thin layers
would increase depth but not per-layer comprehension — the wrong tradeoff for RAG.
V2 training configuration
| Parameter | v1 (R2/R2.5) | v2 | Rationale |
|---|---|---|---|
| Learning rate | 3e-4 → 3e-5 | 1.5e-4 → 1.5e-5 | Lower peak for larger model stability |
| Warmup steps | 500 | 2,000 | Longer warmup for 2.7× more parameters |
| Max steps | 228,000 | 228,000 | Same token budget (~14.9B tokens) |
| Batch size | 8 × 4 | 8 × 4 | Effective 32, same as R2.5 |
| Training data | 22 GB (27 files) | 22 GB (27 files) | Same corpus |
| Precision | bfloat16 | bfloat16 | H100 native |
| Compile | torch.compile | torch.compile | Kernel fusion for speed |
| Checkpoint dir | checkpoints_2048/ | checkpoints_v2/ | Separate from v1 |
17. REPRODUCIBILITY & EXPERIMENT LOG
Every file, command, and decision is documented below to enable exact reproduction and — equally important — to prevent re-running experiments that were already tried.
17.1 File inventory
| File | Size | Purpose |
|---|---|---|
tiny_llm/config.py | 127 lines | V1 ModelConfig (24.7M) + TrainConfig (hyperparameters) |
tiny_llm/config_v2.py | 130 lines | V2 ModelConfig (67.6M) + TrainConfig (RAG-optimized) |
tiny_llm/model.py | 266 lines | V1 Transformer: ALiBi, GQA, SwiGLU, RMSNorm, weight tying |
tiny_llm/model_v2.py | ~300 lines | V2 Transformer: same architecture, larger dimensions |
tiny_llm/train.py | 283 lines | V1 pretraining loop (R1, R2, R2.5) |
tiny_llm/train_v2.py | ~400 lines | V2 pretraining loop with bfloat16 + torch.compile |
tiny_llm/train_sft_rag.py | ~500 lines | RAG-grounded SFT training (reads sft_raw_pairs.json) |
tiny_llm/sft_data.py | ~130 lines | SFT dataset + assistant-only loss masking |
tiny_llm/data.py | 173 lines | Data pipeline for R1: tokenize 8 curated files → 190 MB |
tiny_llm/data_full.py | 183 lines | Streaming pipeline for R2+: tokenize 27 files (22 GB) |
tiny_llm/test_everything.py | 764 lines | 60-test validation suite covering all v1 components |
tiny_llm/generate.py | 108 lines | Text generation from trained checkpoint |
erp_rag/generate/sft_generate.py | 466 lines | API-based SFT data generation (Claude/GPT) |
erp_rag/data/sft_chunk_groups.json | 9.6K lines | Master grouping blueprint (707 groups, 11 rules) |
tokenizers/turkish_bpe_64k/tokenizer.json | 4.7 MB | 64K BPE tokenizer (Phase 1 output) |
data/processed/*.txt | 22 GB | 27 raw text files across 11 domains |
17.2 Checkpoint inventory
| Checkpoint | Size | Step | Loss | Round | Location |
|---|---|---|---|---|---|
step_050000.pt | 283 MB | 50,000 | 2.62 | R1 final | runpod_backup/ |
step_228000.pt | 283 MB | 228,000 | 3.46 | R2 final | runpod_backup/round2_checkpoints/ |
step_228000.pt | 283 MB | 228,000 | 3.33 | R2.5 final | checkpoints_2048/ |
torch.compile(),
which prefixes all state dict keys with _orig_mod.. When loading on a non-compiled model,
strip this prefix: cleaned = {k.replace("_orig_mod.", ""): v for k, v in state_dict.items()}.
This cost ~1 hour of debugging during SFT — do not repeat this mistake.
17.3 Reproduction commands
Step 1: Prepare Round 1 data (local)
python -m tiny_llm.data # tokenizes 8 files → tiny_llm/data/train.bin (190 MB, ~2.5 min)
Step 2: Run validation suite
python -m tiny_llm.test_everything # 60 tests, all must pass before training
Step 3: Round 1 training (RunPod H100)
# Upload project to RunPod, then: python -m tiny_llm.train --resume None # 50K steps, ~2.0 hours, loss → 2.62 # Config: batch=128, lr=3e-4→3e-5, bfloat16, torch.compile
Step 4: Prepare Round 2 data
python -m tiny_llm.data_full # tokenizes all 27 files (22 GB) → train_full.bin (streaming, ~15 min)
Step 5: Round 2 training (RunPod H100)
nohup python -m tiny_llm.train \
--resume tiny_llm/checkpoints/step_050000.pt \
--max-steps 228000 \
--data tiny_llm/data/train_full.bin \
> training_round2.log 2>&1 & # 228K steps, ~10.5 hours, loss → 3.46
Step 6: Round 2.5 training — 2048 context (RunPod H100)
# Modify config: max_seq_len=2048, dropout=0.02, batch_size=8, grad_accum=4 nohup python -m tiny_llm.train \ --resume tiny_llm/checkpoints/step_228000.pt \ --max-steps 228000 \ --data tiny_llm/data/train_full.bin \ --checkpoint-dir tiny_llm/checkpoints_2048 \ > training_r25.log 2>&1 & # 228K steps, ~26.5 hours, loss → 3.33
Step 7: V2 pretraining (RunPod H100)
nohup python -u -m tiny_llm.train_v2 \
> training_v2.log 2>&1 & # 228K steps, uses config_v2.py + model_v2.py
Step 8: Generate SFT data (local, API-based)
python -m erp_rag.generate.sft_generate \
--provider anthropic \
--model claude-sonnet-4-6 # 707 groups → ~8K-10K Q&A pairs, saves to sft_raw_pairs.json
17.4 Experiments tried & decisions locked
| Experiment | What Was Tried | Result | Decision |
|---|---|---|---|
| RoPE vs ALiBi | RoPE implemented and tested before ALiBi | Unstable long-context; requires scaling hacks | ALiBi — locked |
| MHA (8:8) vs GQA (8:2) | Both tested for parameter budget | GQA saves 1.18M params (4.8%) with near-MHA quality | GQA 4:1 — locked |
| Local training (MPS) | M4 MacBook, batch 4×8 accum, float32 | 3,500 ms/step; CPU hit 93°C; passive cooling throttle | RunPod H100 only — locked |
| Round 1: 50K steps, 440 MB | Curated subset (8 files, 67.7% Wikipedia) | Loss 2.62; grammar + basic vocabulary learned | Sufficient for pipeline validation; do not extend |
| Round 2: 228K steps, 22 GB | Full corpus, 27 files, 11 domains | Loss 3.46 (best 3.39 at 200K); loss still decreasing | Capacity-limited at 26M params; more steps = diminishing returns |
| Loss plateau observation | R2 loss ~3.60 at step 95K, ~3.46 at 228K | 0.14 improvement over 133K steps = 573× Chinchilla-optimal | Model is saturated; proceed to SFT, not more pretraining |
| Wikipedia cap (Round 1) | Capped Wikipedia at 300 MB (of 866 MB) | Prevented encyclopedic bias; 67.7% is already dominant | Uncapped in Round 2 (full corpus diversity dilutes bias) |
| Dropout during pretraining | dropout = 0.0 for both rounds | Model is underfitting (capacity-limited), not overfitting | dropout 0.0 for pretraining — locked. Dropout 0.05 for SFT only. |
| LR schedule fresh restart for R2 | Fresh cosine 3e-4 → 3e-5 from step 50K checkpoint | Loss continued decreasing; good decision | Do not resume with decayed LR; always fresh schedule for new data |
| Tokenizer versions | 8 tokenizer variants tested: 16K, 32K (v1/v2), 48K (v1/v2), 64K (v1/v3) | 64K v3 won (~14% fewer tokens than alternatives) | 64K v3 — locked. Do not retrain tokenizer. |
| Round 2.5: 2048 context | Same v1 model, max_seq_len 512→2048, dropout 0.02, batch 8×4 | Loss 3.33 (best 3.22), 0.17 improvement over R2 | Required for RAG. OOM fixed with batch reduction + grad accum. |
| V2 architecture: 67.6M | d_model 256→512, n_kv_heads 2→4, d_ff 688→1376 | 4.2× more transformer params, balanced embed ratio | Width over depth — locked. RAG context converter design. |
| SFT API model comparison | Pilot: Sonnet 4, GPT-5.2, Opus 4.6, Sonnet 4.6 on same 10 groups | Sonnet 4.6 best: diverse questions, deep inference, minimal repetition | Claude Sonnet 4.6 for all SFT data generation — locked. |
| SFT data: intentional wrong answers | Earlier experiment: inject incorrect answers mid-response for self-correction | Terrible results — overall quality degraded significantly | Never inject bad answers into SFT. Use DPO preference pairs if needed. |
17.5 Training logs
| Log File | Size | Contents |
|---|---|---|
runpod_backup/training.log | 454 KB | Round 1: 5,422 lines, steps 0–50,000 |
runpod_backup/training_round2.log | 2.0 MB | Round 2: steps 50,001–228,000, 10.5 hours |
training_r25.log | ~4 MB | Round 2.5: steps 0–228,000, 26.5 hours, 2048 context |
sft_generation.log | ~2 MB | SFT data generation: 707 API calls, pair counts, errors |
18. PROJECT STATUS
| Phase | Status | Key Result |
|---|---|---|
| Phase 1: Tokenizer | COMPLETE | 64K vocab, ~14% fewer tokens than Kumru/TabiBERT, ~2.7× vs GPT-4 |
| Phase 2: Architecture (v1) | COMPLETE | 24.7M params, ALiBi, GQA, SwiGLU, RMSNorm, weight tying; 60/60 tests |
| Phase 3a: Pretrain R1 | COMPLETE | 50K steps, 440 MB subset, loss 2.62, 2.0 hours, $4.76 |
| Phase 3b: Pretrain R2 | COMPLETE | 228K steps, 22 GB full corpus, 14.9B tokens, loss 3.46, 10.5 hours, $24.99 |
| Phase 3c: Pretrain R2.5 | COMPLETE | 228K steps, 2048 context, loss 3.33 (best 3.22), 26.5 hours, $63.08 |
| Phase 2b: Architecture (v2) | COMPLETE | 67.6M params, d_model=512, 2:1 GQA, 4.2× transformer capacity |
| Phase 3d: Pretrain V2 | IN PROGRESS | 228K steps on same 22 GB corpus, ~14.9B tokens |
| Phase 4a: SFT v1 (initial) | COMPLETE | 3,790 QA pairs, val loss 5.12 → 3.10, 100 seconds |
| Phase 4b: SFT data gen (v2) | COMPLETE | 707 groups, 11 rules, Claude Sonnet 4.6 — 7,595 QA pairs |
| Phase 4c: SFT training (v2) | IN PROGRESS | RAG-grounded SFT with 7,595 pairs on v2 model |
| Phase 5: RL (optional) | FUTURE | DPO or RLVR if SFT alone is insufficient |
d_model=512, 2:1 GQA, 4.2× more transformer parameters, with width prioritized
over depth for context comprehension. A parallel SFT data generation pipeline using Claude Sonnet 4.6
produced 7,595 high-quality Turkish Q&A pairs from 532 ERP documentation chunks via 11 grouping
strategies. Total v1 pretraining cost: $92.83 (39 hours on H100 at $2.38/hour).
This report documents Phases 2 and 3 of an independent effort to build a Turkish language-native LLM from scratch. Phase 1 (Tokenizer) and Phase 4 (SFT) are documented separately.
© 2026 • Independent Research