Back to Research

ARCHITECTURE & PRETRAINING

Phase 2 & 3 — Two Generations of Turkish Transformers: 24.7M (v1) & 67.6M (v2) from Scratch

February 2026 • Independent Research • IN PROGRESS

24.7M
V1 PARAMETERS
67.6M
V2 PARAMETERS
3
V1 TRAINING ROUNDS
3.22
BEST LOSS (V1, R2.5)
~44.7B
TOKENS (V1 TOTAL)
$92.83
V1 TRAINING COST
Abstract. This report documents the design, implementation, and pretraining of a 24.7M-parameter decoder-only Transformer for Turkish — built entirely from scratch in PyTorch without any pretrained weights or HuggingFace model code. The architecture incorporates five deliberate departures from convention: ALiBi positional encoding (rejecting RoPE), Grouped Query Attention at 4:1 ratio, SwiGLU activation, RMSNorm with pre-norm placement, and weight tying between embedding and output projection. Each decision is justified against specific alternatives with parameter-budget and stability considerations. A 60-test validation suite uncovered a critical causal mask bug in the ALiBi implementation prior to training — demonstrating the necessity of component-level verification in from-scratch implementations. Training proceeded in three rounds: Round 1 on a ~440 MB curated subset (50K steps, loss 2.62, 2.0 hours) established basic Turkish fluency; Round 2 on the full 22 GB corpus across 11 domains (228K steps, 14.9B token-reads, 10.5 hours) pushed the model to factual, correct Turkish at loss 3.46 (best 3.39); Round 2.5 retrained with 2048-token context for RAG compatibility (228K steps, loss 3.33, best 3.22, 26.5 hours). A second-generation architecture (v2, 67.6M params) was designed specifically for RAG: d_model=512, 2:1 GQA, 4.2× more transformer parameters, optimized as a context converter rather than a knowledge base. All rounds ran on NVIDIA H100 ($2.38/hour) with bfloat16 mixed precision and torch.compile. Total v1 pretraining cost: approximately $92.83.
Document structure. Sections 1–14 cover the v1 architecture (24.7M params) and its two initial training rounds. Section 15 documents the 2048-context extension (Round 2.5). Section 16 introduces the v2 architecture (67.6M params), a purpose-built RAG model. All architectural decisions in Sections 3–6 (ALiBi, GQA, SwiGLU, RMSNorm) carry forward to v2 — v2 scales the dimensions, not the design.

TABLE OF CONTENTS

1. Motivation: From Tokenizer to Model 2. V1 Architecture Overview 3. Positional Encoding: ALiBi (Not RoPE) 4. Grouped Query Attention 5. SwiGLU Feed-Forward Network 6. Normalization & Residual Design 7. V1 Parameter Budget Analysis 8. Training Corpus Curation 9. Data Pipeline 10. V1 Training Configuration 11. Validation Suite: 60 Paranoid Tests 12. Infrastructure: MPS to H100 13. V1 Round 1 Results 14. V1 Round 2: Full Corpus (22 GB) 15. Round 2.5: 2048-Context for RAG 16. V2 Architecture: 67.6M RAG-Optimized 17. Reproducibility & Experiment Log 18. Project Status

1. MOTIVATION: FROM TOKENIZER TO MODEL

Phase 1 produced a 64K Turkish BPE tokenizer that is ~14% more efficient than existing Turkish tokenizers and ~2.7× more efficient than GPT-4 on Turkish text. The natural question follows: can this tokenizer serve as the foundation for a purpose-built Turkish language model?

The objective is not to compete with production LLMs. The objective is threefold: (1) validate the tokenizer in an end-to-end training pipeline, (2) establish every component of the training infrastructure from scratch, and (3) produce models that demonstrably learn Turkish language patterns and can serve as domain-specific RAG assistants. This led to two generations: v1 (24.7M params) to prove the pipeline works, and v2 (67.6M params) purpose-built for RAG context comprehension.

64K TOKENIZER V1: 24.7M R1 + R2 + R2.5 V1 SFT V2: 67.6M V2 PRETRAIN V2 SFT RLVR

2. V1 ARCHITECTURE OVERVIEW

The v1 model is a decoder-only Transformer with 24,697,088 parameters. Every component was selected against specific alternatives; no default was accepted without justification. All architectural decisions below (ALiBi, GQA, SwiGLU, RMSNorm, weight tying) carry forward to v2 — only the dimensions change.

ComponentChoiceRejected AlternativeRationale
ArchitectureDecoder-onlyEncoder-decoder, Encoder-onlyAutoregressive generation is the goal; encoder stack adds unnecessary cross-attention
Position encodingALiBiRoPE, Learned, SinusoidalZero learned params; train-short-test-long generalization; RoPE unreliable at extrapolation
AttentionGQA (4:1)MHA (8:8), MQA (8:1)75% KV parameter reduction vs MHA; retains multi-view capacity unlike MQA
FFN activationSwiGLUReLU, GELU, GeGLUGated mechanism outperforms per-parameter; 3 projections at (8/3)×d maintains budget
NormalizationRMSNormLayerNorm, BatchNorm~10-15% faster per layer; mean subtraction is redundant with pre-norm residual
Norm placementPre-normPost-normUnobstructed residual gradient path; stable training without careful LR tuning
Output projectionWeight tyingSeparate lm_headSaves 16.4M parameters (66% of model); embedding serves dual purpose
Linear layersNo biasWith biasRedundant with RMSNorm re-centering; simplifies weight decay
Dropout0.00.1–0.3Pretraining goal is to absorb data, not regularize; underfitting is the risk

Configuration

ParameterValueDerivation
vocab_size64,000Phase 1 tokenizer; 64K × 256 = 16.4M embedding params
d_model256Minimum for head_dim=32 with 8 heads
n_layers12Depth over width at small scale (SmolLM2 finding)
n_heads88 attention patterns; head_dim = 256/8 = 32
n_kv_heads2GQA 4:1 ratio; 4 query heads share each KV head
d_ff688≈ (8/3) × 256; preserves FFN param budget with SwiGLU’s 3 matrices
max_seq_len512ALiBi generalizes beyond training length; 512 is memory-safe on consumer GPU
dropout0.0No regularization during pretraining
weight_tyingtrueEmbedding = LM head; saves 16.4M params

3. POSITIONAL ENCODING: ALiBi (NOT ROPE)

Positional encoding informs the model where tokens are in the sequence. The dominant approach in 2024–2026 is Rotary Position Embeddings (RoPE), used by Llama, Mistral, and Qwen. This work rejects RoPE in favor of ALiBi (Attention with Linear Biases, Press et al. 2022).

Why not RoPE

RoPE applies rotation matrices to query and key vectors based on position. While mathematically elegant, it exhibits two practical problems: (1) poor extrapolation beyond training length without ad-hoc scaling hacks (NTK-aware, YaRN, etc.), and (2) additional learned parameters that interact with the attention computation in ways that are difficult to debug at small scale. Prior empirical observation on this project confirmed unstable long-context behavior with RoPE.

ALiBi mechanism

ALiBi adds a linear distance penalty directly to attention scores. No parameters are learned. Each attention head receives a different slope (geometric sequence), creating a spectrum from sharp local attention to broad distant attention.

PropertyALiBiRoPELearnedSinusoidal
Learned parameters0Implicit (rotations)seq_len × d_model0
Length generalizationTrain short, test longRequires scaling hacksHard-cappedDegrades
Implementation complexityLow (bias matrix)Medium (rotations)LowLow
Industry adoptionMPT, BLOOMLlama, Mistral, QwenGPT-2Original Transformer

Slope computation

For n attention heads, slopes form a geometric sequence: slopei = 2-8i/n for i ∈ {1, ..., n}. With 8 heads, slopes range from 0.5 (sharp, local focus) to 2-8 ≈ 0.0039 (broad, distant attention).

Head12345678
Slope0.50.250.1250.06250.03130.01560.00780.0039
BehaviorStrong local focusMedium rangeBroad / distant
BUG FOUND DURING VALIDATION: The initial ALiBi implementation computed relative distances with transposed indices, causing future positions to receive finite penalty values instead of hard -inf. The causal mask was effectively leaking — the model could attend to future tokens with a soft distance penalty rather than being blocked entirely. This would have produced a model that appears to train normally but fails catastrophically at inference (where future tokens are unavailable). The bug was caught by test #22 of the validation suite (Section 11) and corrected before training began.

4. GROUPED QUERY ATTENTION

Standard Multi-Head Attention (MHA) assigns independent Key and Value projections to each attention head. Grouped Query Attention (GQA, Ainslie et al. 2023) shares KV projections across groups of query heads, reducing memory and parameter cost without proportional quality loss.

VariantQ HeadsKV HeadsKV Params/LayerQuality
MHA (8:8)88131,072Maximum
GQA (8:2)8232,768Near-MHA
MQA (8:1)8116,384Degraded

The 4:1 ratio saves ~98K parameters per layer × 12 layers = 1.18M parameters total versus MHA. At 24.7M total, this represents a 4.8% budget reallocation. Four query heads share each KV head via tensor repetition (_repeat_kv): the KV tensor is expanded along the head dimension without copying data.

Projection dimensions

ProjectionShapeParameters
Q (query)256 × 25665,536
K (key)256 × 6416,384
V (value)256 × 6416,384
O (output)256 × 25665,536
Total attention/layer163,840

5. SwiGLU FEED-FORWARD NETWORK

The standard Transformer FFN uses two projections with a nonlinear activation: FFN(x) = W2 · σ(W1 · x). SwiGLU (Shazeer 2020) replaces this with a gated variant using three projections: FFN(x) = Wdown · (SiLU(Wgate · x) ⊙ Wup · x).

Parameter equivalence

Standard FFN with expansion factor 4: 2 × d × 4d = 8d². SwiGLU with 3 matrices: 3 × d × dff. Setting 3 × dff = 8d gives dff = (8/3)d ≈ 688 for d = 256. Total FFN parameters per layer:

FFN TypeMatricesd_ffParams/Layer
Standard (ReLU/GELU)21,024524,288
SwiGLU3688528,384

Nearly identical parameter budget (+0.8%), empirically superior activation function. The gating mechanism allows the network to selectively amplify or suppress information — a capability that standard FFNs lack.

6. NORMALIZATION & RESIDUAL DESIGN

RMSNorm over LayerNorm

LayerNorm (Ba et al. 2016) performs two operations: mean subtraction and variance normalization. RMSNorm (Zhang & Sennrich 2019) drops the mean subtraction entirely, computing only the root mean square. With 12 layers × 2 norms per layer = 24 norm operations per forward pass, the cumulative speedup is measurable. Each RMSNorm has exactly d_model = 256 learnable parameters (the scale vector).

Pre-norm residual structure

Each Transformer block follows the pattern:

x = x + Attention(RMSNorm(x))     # residual path is unobstructed
x = x + FFN(RMSNorm(x))           # gradient flows directly through addition

The alternative — post-norm — places the normalization after the residual addition: x = RMSNorm(x + Attention(x)). This creates a gradient bottleneck through the norm layer. Pre-norm eliminates this bottleneck, producing more stable training in deep networks without requiring careful learning rate tuning.

7. V1 PARAMETER BUDGET ANALYSIS

At 24.7M parameters, every allocation decision is visible. The embedding layer dominates — a structural consequence of pairing a large vocabulary (64K) with a small hidden dimension (256).

ComponentParameters% of TotalNotes
Token embedding (tied)16,384,00066.3%64,000 × 256; shared with output projection
Attention (all layers)1,966,0808.0%163,840 per layer × 12
FFN / SwiGLU (all layers)6,340,60825.7%528,384 per layer × 12
RMSNorm (all layers + final)6,400<0.1%256 per norm × 25 norms
LM head00%Weight tying: reuses embedding
Total24,697,088100%
66.3%
EMBEDDING (TIED)
25.7%
FFN (SwiGLU)
8.0%
ATTENTION (GQA)
Observation: Embedding dominance (66.3%) is a known property of small language models. Without weight tying, the model would be 41.1M parameters — with 79.7% consumed by two embedding matrices. Weight tying halves this cost while enforcing a useful symmetry: the input representation and output prediction share the same vector space.

Weight initialization

All 2D weight matrices are initialized with Normal(0, 0.02). Output projections (o_proj, down_proj) are additionally scaled by 1/√(2 × n_layers) = 1/√24 ≈ 0.204 to prevent residual stream explosion through 12 layers of additive contributions. This follows the GPT-2 initialization scheme.

8. TRAINING CORPUS CURATION

The full Phase 1 corpus spans 22 GB across 11 domains. For pretraining a 24.7M model, a curated subset of ~440 MB (raw text) was selected. The selection criteria: (1) prioritize raw text over structured data, (2) include reasoning-oriented material, (3) exclude instruction-tuning data (reserved for SFT in Phase 4), (4) cap Wikipedia to prevent encyclopedic dominance.

FileRaw SizeTokens% of CorpusSelection Rationale
wikipedia_tr.txt310 MB (capped from 866 MB)67,407,76267.7%General knowledge, encyclopedic Turkish
orca_math_tr.txt117 MB28,648,30528.8%Mathematical reasoning (north star objective)
tdk_full.txt9.5 MB2,398,6112.4%Official dictionary; proper word usage
turkish_folk_songs.txt2.3 MB653,9720.7%Colloquial/emotional Turkish
turkish_idioms_proverbs.txt1.6 MB369,4620.4%Idiomatic language, cultural knowledge
turkish_poems.txt0.4 MB100,1410.1%Literary Turkish, complex grammar
turkish_mmlu_exams.txt0.2 MB40,187<0.1%Academic/exam format, broad vocabulary
literary_short_stories.txt0.1 MB18,596<0.1%Narrative structure, dialogue
Total~441 MB99,637,036100%

Excluded data in Round 1 (later used in Round 2)

FileSizeR1 Exclusion RationaleR2 Status
instruc_turca.txt3.7 GBInstruction data; format learning before language learning is counterproductiveIncluded
rag_dataset_tr.txt31 MBERP-domain-specific; too narrow for general pretrainingIncluded
Wikipedia (remaining)556 MBCapped at 300 MB to prevent encyclopedic biasUncapped
Academic, legal, medical…~16 GBNot yet collected at time of Round 1Included
Round 2 used the full corpus. All 22 GB / 27 files across 11 domains were included in Round 2 (Section 14). The curated subset approach worked well for Round 1 to validate the pipeline; Round 2 removed all data restrictions to maximize the model’s exposure to diverse Turkish text.
Data composition principle: Chinchilla scaling (Hoffmann et al. 2022) suggests ~20N optimal training tokens for an N-parameter model. For 24.7M parameters, optimal is ~500M tokens. The corpus contains 99.6M tokens; with 50K steps × 65,536 tokens/step = 3.27B token-reads, each token is seen ~33 times. This level of repetition is acceptable for a model at this scale given the diversity of 8 source types.

9. DATA PIPELINE

The pipeline converts raw text to training batches in two offline steps, followed by real-time random sampling during training.

Step 1: Tokenization (offline, one-time)

Raw .txt files are split into documents by double newline, filtered (minimum 20 characters, 5 tokens), wrapped with BOS/EOS markers, and tokenized with the 64K BPE tokenizer. The resulting integer sequence is stored as a contiguous uint16 binary file (190 MB). The choice of uint16 is exact: vocab_size = 64,000 < 65,535 (max uint16), using exactly 2 bytes per token with no waste.

Step 2: Memory-mapped random sampling

The binary file is memory-mapped via numpy.memmap. Each training sample is drawn by selecting a random starting position and extracting a contiguous window of seq_len + 1 = 513 tokens. The first 512 tokens serve as input; the last 512 (shifted by one) serve as targets. This approach has no concept of epochs — with 99.6M tokens and 512-token windows, the number of possible starting positions (~194K) vastly exceeds any practical training run.

Pipeline StageInputOutputTime
Tokenize (64K BPE)441 MB raw text (8 files)190 MB uint16 binary~2.5 min
Memory map190 MB binaryRandom access via OS page cacheInstant
SampleRandom offset(512 input, 512 target) tensor pair<1 ms

10. V1 TRAINING CONFIGURATION

HyperparameterValueJustification
OptimizerAdamWDecoupled weight decay; industry standard for Transformers
β1, β20.9, 0.95β2=0.95 (not 0.999): faster adaptation to changing gradient landscape
Learning rate3 × 10-4Scaling law: smaller models tolerate higher LR
Min learning rate3 × 10-510:1 ratio; prevents complete learning cessation in final steps
LR scheduleCosine decayMore time at peak LR than linear; Chinchilla standard
Warmup500 steps~1% of training; stabilizes optimizer momentum estimates
Weight decay0.1Applied to 2D+ matrices only; norms exempt (their target is 1.0, not 0.0)
Gradient clipping1.0 (global norm)Prevents catastrophic updates from gradient spikes; direction preserved
Max steps50,0003.27B token-reads at batch=128; ~33 passes over corpus
Precisionbfloat16 (CUDA) / float32 (MPS)bfloat16 has float32 range with float16 size; MPS lacks bfloat16 support

Effective batch size

SettingMPS (MacBook M4)CUDA (H100 NVL) — Round 1CUDA (H100 NVL) — Round 2
Physical batch4128128
Gradient accumulation811
Effective batch32128128
Tokens per step16,38465,53665,536
Total steps50,000228,000
Total token-reads3.27B14.94B

11. VALIDATION SUITE: 60 PARANOID TESTS

Before committing compute to training, every component of the pipeline was validated by a 60-test suite covering 13 categories. The suite was designed to catch the class of bugs that produce models which appear to train normally but fail silently.

CategoryTestsWhat It Validates
1. Tokenizer8Loading, vocab size, special tokens, Turkish encoding, roundtrip, uint16 safety
2. Model Architecture6Parameter count, weight tying on/off, layer count, hidden dimension
3. RMSNorm4Shape preservation, unit scale, parameter count, zero-input stability
4. ALiBi7Geometric slopes, exact values, shape, causal mask integrity, diagonal, distance penalty, generalization
5. GQA4Output shape, 4:1 ratio, projection shapes, causal independence
6. SwiGLU3Shape, 3 projections, per-layer parameter count
7. Transformer Block3Shape, residual connections, pre-norm ordering
8. Full Model7Logits shape, loss shape, initial loss sanity, gradient flow, NaN detection, learning verification
9. Generation4Token production, max_tokens, determinism, valid ID range
10. Data Pipeline5File existence, sizes, tokenizer path, directory writability
11. LR Schedule3Warmup ramp, cosine decay endpoint, monotonic decrease
12. Device3MPS/CUDA availability, model execution, ALiBi transfer
13. Numerical Stability3Edge-case IDs, full-length sequences, gradient accumulation equivalence
Total60All passed prior to training
Critical bug caught by test #22 (ALiBi causal mask). The original implementation of build_alibi_bias computed relative distances as positions.unsqueeze(0) - positions.unsqueeze(1), which transposed the query/key axes. Future positions received the ALiBi distance penalty (e.g., -0.5) instead of hard -inf. This meant the model could attend to future tokens with a softened penalty rather than being fully masked. Training would appear normal (loss decreasing, gradients stable) but the model would learn to rely on information that is unavailable during autoregressive generation. The fix: separate the causal mask (hard -inf for future) from the distance penalty (soft negative values for past), using positions.unsqueeze(1) - positions.unsqueeze(0) with explicit clamp(min=0) on distances before applying masked_fill_.

12. INFRASTRUCTURE: MPS TO H100

Training was initiated on Apple M4 (MacBook Air, passive cooling) to validate the pipeline, then migrated to NVIDIA H100 NVL on RunPod for production training.

MetricM4 MacBook Air (MPS)H100 NVL (CUDA)Speedup
GPUApple M4 (integrated)NVIDIA H100 NVL 95 GB
Precisionfloat32bfloat16 (autocast)2× throughput
torch.compileNot supportedEnabled~30-50% speedup
Batch size4 × 8 accum = 32128 × 1 = 1284× tokens/step
Step time~3,500 ms~140 ms25×
Tokens/sec~5,000~400,00080×
VRAM used~8 GB (shared)72 GB / 95 GB
GPU utilization~100% (throttled to 80°C)97% at 53°C
Estimated total time~48 hours~2 hours24×
Estimated costElectricity only~$4.76 (RunPod, $2.38/hr)
25×
STEP TIME SPEEDUP
400K
TOKENS/SEC (H100)
$4.76
TOTAL TRAINING COST

Migration changes

Three modifications were required to move from MPS to CUDA:

13. V1 ROUND 1 RESULTS (50K STEPS, 440 MB)

COMPLETE — 50,000 steps, 2.0 hours, final loss 2.62. This section documents Round 1 on the ~440 MB curated subset. Round 2 on the full 22 GB corpus is documented in Section 14.

Loss curve

StepLossPerplexityLRTokens/secPhase
1011.07~64,0006.00e-0615,138Warmup (random guessing, loss ≈ ln(64000))
1009.61~15,0006.00e-0596KWarmup (learning token frequencies)
5006.00~4033.00e-04290KWarmup complete; peak LR reached
1,0004.82~1243.00e-04355KWord boundaries, basic grammar
2,0003.93~512.99e-04399KCommon phrases, suffixes
3,0003.60~372.98e-04415KSentence fragments forming
8,0003.15~232.85e-04438KMathematical notation, equations
10,0003.12~232.76e-04441KWikipedia articles, proper nouns
12,0003.05~212.66e-04443KDictionary format, subordinate clauses
19,0002.94~192.17e-04446KEncyclopedic titles, date suffixes
30,0002.78~161.48e-04449KCosine decay phase; diminishing returns
50,0002.62~143.00e-05451KFinal: filmographies, cultural references
2.62
FINAL LOSS
~14
FINAL PERPLEXITY
3.27B
TOKENS PROCESSED
2.0 hrs
TOTAL TRAINING TIME
Observation: Loss decreased monotonically from 11.07 to 2.62, eliminating 99.97% of initial uncertainty. Perplexity of ~14 means the model narrows 64,000 vocabulary tokens to ~14 plausible candidates at each position. The loss curve exhibits three distinct phases: rapid descent (steps 0–5K, dominated by frequency learning), steady descent (steps 5K–25K, grammatical and structural learning), and plateau (steps 25K–50K, diminishing returns from cosine decay and corpus repetition).

Generated samples across training

Three hardcoded prompts (“Merhaba”, “Türkiye”, “Stok”) were sampled every 1,000 steps at temperature 0.8 with top-k 40. The prompt is selected randomly each time. No rules were programmed; all linguistic structure was learned from next-token prediction alone.

StepPromptOutputObservation
1,000StokStokonSingle fragment; knows token boundaries
2,000MerhabaMerhabaRecognizes greeting; stops at EOS
3,000MerhabaMerhaba, 6. sezon.First multi-token output; correct grammar and punctuation
8,000StokStokes) = 5.000, K + J - 3J = 15.000Mathematical notation from orca_math_tr.txt (28.8% of corpus)
10,000MerhabaMerhaba Dünya Kızı, Istanbul’a GidiyorFolk song register; correct apostrophe + dative suffix
12,000MerhabaÖrnek: Gözlerini bu kadar beğenip, iyi bir şey sevdiğine de biraz daha âşık olTDK dictionary format; subordinate clauses with -duğunu
19,000TürkiyeTürkiye’deki etnik gruplar, Moğolistan’ın Yahudi tarihi, 1901’de TürkiyeWikipedia article titles; correct locative/possessive suffixes
50,000MerhabaCary, “Deli Gömülü” (2001), Hymnogy, “İman İçin Bir Şey” (2002)Filmography/discography entries with quoted titles and years
Emergent domain switching. The model produces qualitatively different output depending on (a) the prompt token and (b) the training stage. The same prompt “Merhaba” yields a greeting at step 2,000, a folk song at step 10,000, a dictionary entry at step 12,000, and a filmography at step 50,000. No explicit domain labels or curriculum were used. The behavior emerges purely from the statistical structure of the training data as encoded by the 64K tokenizer.

14. V1 ROUND 2: FULL CORPUS (228K STEPS, 22 GB)

Round 1 demonstrated the pipeline worked and the model could learn Turkish from a ~440 MB subset. Round 2 scaled to the full 22 GB corpus — the same corpus used to train the 64K tokenizer in Phase 1 — across all 11 domains: general knowledge, academic, legal, medical, financial, education, news, code, literary, reasoning, and instructions.

228K
TOTAL STEPS
14.9B
TOKENS PROCESSED
10.5h
TRAINING TIME
3.46
FINAL LOSS
3.39
BEST LOSS

Why Round 2?

Round 1 trained on 99.6M tokens (441 MB subset, 67.7% Wikipedia). The model learned grammar and basic vocabulary but lacked domain diversity. Round 2 exposed the model to legal Turkish (court decisions), medical terminology, financial reporting, news journalism, academic writing, and instruction-following patterns — preparing a more robust base for SFT specialization.

Training data: full corpus

DomainSourcesSize
General KnowledgeWikipedia TR (520K articles, uncapped)866 MB
Academic/ThesisBellaTurca AkademikDerlem (668K papers)3.5 GB
Cultural/Literary WebBellaTurca ÖzenliDerlem (1.4M curated docs)4.4 GB
News/Journalism1.8M news articles + summarization corpus4.5 GB
Legal/Law700K court decisions + Constitutional Court3.7 GB
Instructions2.5M instruction-answer pairs3.7 GB
CodePython corpus569 MB
FinancialKAP announcements, capital markets425 MB
ReasoningMath problems, RAG, chain-of-thought221 MB
MedicalMedical reasoning + hospital articles108 MB
Education & VocabularyQA, MMLU exams, TDK dictionary, literature~100 MB
Total27 files, 11 domains22 GB

Round 2 configuration changes

ParameterRound 1Round 2Rationale
Training data441 MB (8 files)22 GB (27 files)Full corpus for maximum domain coverage
Max steps50,000228,000Scaled proportionally to data volume
Batch size128128Unchanged
Starting pointRandom initstep_050000.ptResume from Round 1 checkpoint
Learning rate3e-4 → 3e-53e-4 → 3e-5Fresh cosine schedule from peak

Loss curve: Round 2

StepLossLRTok/sSample Quality
50,000 (R1 end)2.623.0e-05451KFilmographies, encyclopedic entries
95,000~3.602.0e-04399KSimple sentences, some repetition
145,000~3.501.1e-04401KCoherent multi-clause sentences
195,000~3.474.4e-05402KFactual: “Kocaeli’ndeyiz”
200,0003.394.0e-05403KBest loss — human rights discussion
228,0003.463.0e-05403KFinal: real-world knowledge, correct grammar
Note on loss difference. Round 1 loss (2.62) and Round 2 loss (3.46) are not directly comparable. Round 1 trained on a ~440 MB curated subset dominated by Wikipedia; Round 2 trained on 22 GB spanning 11 domains including legal, medical, and financial text. The higher absolute loss reflects the much harder prediction task across diverse registers — not a regression. Sample outputs confirm dramatically improved language quality.

Sample evolution: Round 2

StepSample OutputQuality
95,000“Bu e-postayı seviyorum! Bu e-postanın amacı, bu e-postanın ana no…”Repetitive, generic
145,000“Türkiye’de ve dünyada ekonomik gelişmeler açısından büyük önem taşımaktadır”Coherent, meaningful
195,000“Türkiye’nin en büyük ikinci sanayi şehri konumundaki Kocaeli’ndeyiz”Factual, specific, correct suffixes
200,000“Türkiye’de tüm dünyada ‘insan hakları’ndan söz edildiği gibi…”Complex topic, proper structure
Scaling law observation. At 26M parameters, the v1 model is capacity-limited, not data-limited. Chinchilla-optimal training for 26M params is ~520M tokens (20× params); this model processed 14.9B tokens (~573× params). Loss was still decreasing at step 228K but with diminishing returns — the model had exhausted most of its representational capacity. This capacity ceiling directly motivated the v2 architecture (67.6M, Section 16): 4.2× more transformer parameters to break through the v1’s representational limit.
$4.76
ROUND 1 (2.0h × $2.38)
$24.99
ROUND 2 (10.5h × $2.38)
$29.75
R1+R2 SUBTOTAL
Not the end. Round 2.5 (2048-context extension, $63.08) and the v2 architecture (67.6M) follow. See Sections 15–16 for the continuation.

15. ROUND 2.5: 2048-CONTEXT FOR RAG (228K STEPS)

Round 2 trained with max_seq_len=512, inherited from the initial architecture. However, the RAG use case requires processing system prompt + retrieved context chunk + user question + generated answer in a single sequence. Typical RAG prompts consume 600–1,300 tokens. A 512-token model cannot serve RAG — so Round 2.5 retrained the same 24.7M model with 2048-token context.

228K
TOTAL STEPS
~14.9B
TOKENS PROCESSED
26.5h
TRAINING TIME
3.33
FINAL LOSS
3.22
BEST LOSS

Why Round 2.5?

Round 2’s 512-token context is a hard ceiling for downstream tasks. The SFT phase needs to fit:

Total: 308–1,198 tokens per turn. A 512-token model would truncate most inputs. ALiBi’s extrapolation property helps, but training at the target context length yields far better attention patterns. 2048 tokens provides comfortable headroom for even the longest multi-chunk RAG prompts.

Configuration changes from Round 2

ParameterRound 2Round 2.5Rationale
max_seq_len51220484× context for RAG prompt fitting
dropout0.00.02Mild regularization; reduces repetitive generation
batch_size12884× longer sequences = 4× more memory per sample
grad_accum_steps14Effective batch = 32 (8 × 4). Preserves batch scale.
Starting pointstep_050000.ptstep_228000.ptResume from Round 2 final checkpoint
Learning rate3e-4 → 3e-53e-4 → 3e-5Fresh cosine schedule for context adaptation
Training data22 GB (27 files)22 GB (27 files)Same corpus, now with 4× longer windows
OOM resolution. Initial attempt with batch_size=32 (same effective batch as R2) triggered CUDA out-of-memory on H100 80GB. The 4× sequence length quadrupled attention memory. Solution: reduce per-GPU batch to 8, compensate with 4-step gradient accumulation. Effective batch size stays at 32, but each step takes ~2.5× longer due to the longer sequences and accumulation overhead: 418 ms/step vs ~165 ms/step in Round 2.

Loss curve: Round 2.5

StepLossLRTok/sNotes
0 (R2 checkpoint)~3.463.0e-04158KStarting from R2 final, fresh LR schedule
~50,000~3.402.6e-04158KModel adapting to 4× context windows
~100,000~3.351.9e-04158KSteady improvement from longer-range attention
~150,000~3.281.2e-04158KCross-sentence coherence improving
~200,0003.225.0e-05158KBest loss — long-range dependencies learned
228,0003.332.0e-05158KFinal: LR at minimum, slight loss uptick
Loss improvement vs Round 2. Best loss dropped from 3.39 (R2) to 3.22 (R2.5) — a 0.17 improvement despite using the identical corpus and model. The gain comes entirely from longer context: the model now sees 4× more tokens per sample, learning cross-sentence dependencies and paragraph-level coherence that were invisible in 512-token windows. This confirms that context length was a binding constraint for this architecture.
Dropout effect. Adding dropout=0.02 was motivated by observed repetitive generation patterns during R2 sampling. While the primary goal was RAG context extension, the mild dropout provides regularization during the subsequent SFT phase where the small training set (thousands, not billions, of examples) creates overfitting risk. The 0.02 value was chosen conservatively — enough to break repetition loops without degrading pretraining quality.
$63.08
ROUND 2.5 (26.5h × $2.38)
$92.83
TOTAL V1 COST (R1+R2+R2.5)
39h
TOTAL V1 TRAINING TIME

16. V2 ARCHITECTURE: 67.6M RAG-OPTIMIZED MODEL

The v1 model (24.7M params) proved that the training pipeline, tokenizer, and infrastructure work. But 24.7M parameters is severely capacity-limited for a RAG assistant that must read context, understand questions, and generate coherent answers. The v2 architecture was designed from scratch with a single principle: this model is a context converter, not a knowledge base.

Design philosophy. A RAG model doesn’t need to memorize facts — the retriever provides them. What it needs is the ability to read provided context and transform it into accurate answers. This means maximizing attention quality and representation width, not raw parameter count. Width (larger d_model) matters more than depth (more layers) for context comprehension.

Architecture comparison: v1 vs v2

Parameterv1 (24.7M)v2 (67.6M)Change
d_model2565122× representation width
n_layers1212Same depth
n_heads88Same query heads
n_kv_heads244:1 → 2:1 GQA, richer attention diversity
head_dim32642× per-head capacity (matches GPT-2, LLaMA standard)
d_ff68813762× FFN capacity
max_seq_len5122048Native RAG context length
dropout0.00.02Proven in R2.5
Embedding params16.4M (66.3%)32.8M (48.5%)Balanced, not vocabulary-heavy
Transformer params8.3M (33.7%)34.8M (51.5%)4.2× more compute capacity
Total params24.7M67.6M2.7× total, 4.2× transformer

Parameter budget: v2

32.8M
EMBEDDING (48.5%)
34.8M
TRANSFORMER (51.5%)
2.9M
PER LAYER
512
FINAL NORM

Per-layer breakdown (2,900,992 params)

ComponentComputationParameters
Q projection512 × 512262,144
K projection512 × 256 (4 KV heads × 64)131,072
V projection512 × 256131,072
O projection512 × 512262,144
Attention subtotal786,432
Gate (SwiGLU)512 × 1376704,512
Up (SwiGLU)512 × 1376704,512
Down (SwiGLU)1376 × 512704,512
FFN subtotal2,113,536
Norms (attn + ffn)512 + 5121,024
Layer total2,900,992
Key design decision: width over depth. Keeping 12 layers (same as v1) while doubling d_model from 256 to 512 was deliberate. For a context-conversion task, each layer needs enough representational capacity to attend over long RAG contexts (2048 tokens) and capture the relationship between question tokens and answer tokens scattered across the context. Wider layers with 64-dim attention heads (matching LLaMA/GPT-2 standard) provide this. Adding more thin layers would increase depth but not per-layer comprehension — the wrong tradeoff for RAG.
GQA relaxation: 4:1 → 2:1. The v1 model used aggressive 4:1 GQA (8 query heads, 2 KV heads) to save parameters at 24.7M scale. With the v2 budget at 67.6M, this was relaxed to 2:1 (8 query heads, 4 KV heads). Each pair of query heads now shares a unique KV head, enabling more diverse attention patterns. This costs an extra ~3.1M parameters across 12 layers but significantly improves the model’s ability to attend to different parts of the context simultaneously — critical for RAG where the answer may depend on information scattered across the chunk.

V2 training configuration

Parameterv1 (R2/R2.5)v2Rationale
Learning rate3e-4 → 3e-51.5e-4 → 1.5e-5Lower peak for larger model stability
Warmup steps5002,000Longer warmup for 2.7× more parameters
Max steps228,000228,000Same token budget (~14.9B tokens)
Batch size8 × 48 × 4Effective 32, same as R2.5
Training data22 GB (27 files)22 GB (27 files)Same corpus
Precisionbfloat16bfloat16H100 native
Compiletorch.compiletorch.compileKernel fusion for speed
Checkpoint dircheckpoints_2048/checkpoints_v2/Separate from v1
Chinchilla analysis for v2. At 67.6M parameters, Chinchilla-optimal training is ~1.35B tokens (20× params). Our 228K-step schedule processes ~14.9B tokens (~220× params), far beyond Chinchilla-optimal. This is intentional: the model will be fine-tuned for a specific RAG task, so over-training on diverse Turkish text builds stronger linguistic foundations than stopping at the compute-optimal point. The full 22 GB corpus provides enough variety to prevent memorization even at 220× over-training.
V1: 24.7M
R1: 50K steps
R2: 228K steps
R2.5: 2048 ctx
V2: 67.6M pretrain
Deploy

17. REPRODUCIBILITY & EXPERIMENT LOG

Every file, command, and decision is documented below to enable exact reproduction and — equally important — to prevent re-running experiments that were already tried.

17.1 File inventory

FileSizePurpose
tiny_llm/config.py127 linesV1 ModelConfig (24.7M) + TrainConfig (hyperparameters)
tiny_llm/config_v2.py130 linesV2 ModelConfig (67.6M) + TrainConfig (RAG-optimized)
tiny_llm/model.py266 linesV1 Transformer: ALiBi, GQA, SwiGLU, RMSNorm, weight tying
tiny_llm/model_v2.py~300 linesV2 Transformer: same architecture, larger dimensions
tiny_llm/train.py283 linesV1 pretraining loop (R1, R2, R2.5)
tiny_llm/train_v2.py~400 linesV2 pretraining loop with bfloat16 + torch.compile
tiny_llm/train_sft_rag.py~500 linesRAG-grounded SFT training (reads sft_raw_pairs.json)
tiny_llm/sft_data.py~130 linesSFT dataset + assistant-only loss masking
tiny_llm/data.py173 linesData pipeline for R1: tokenize 8 curated files → 190 MB
tiny_llm/data_full.py183 linesStreaming pipeline for R2+: tokenize 27 files (22 GB)
tiny_llm/test_everything.py764 lines60-test validation suite covering all v1 components
tiny_llm/generate.py108 linesText generation from trained checkpoint
erp_rag/generate/sft_generate.py466 linesAPI-based SFT data generation (Claude/GPT)
erp_rag/data/sft_chunk_groups.json9.6K linesMaster grouping blueprint (707 groups, 11 rules)
tokenizers/turkish_bpe_64k/tokenizer.json4.7 MB64K BPE tokenizer (Phase 1 output)
data/processed/*.txt22 GB27 raw text files across 11 domains

17.2 Checkpoint inventory

CheckpointSizeStepLossRoundLocation
step_050000.pt283 MB50,0002.62R1 finalrunpod_backup/
step_228000.pt283 MB228,0003.46R2 finalrunpod_backup/round2_checkpoints/
step_228000.pt283 MB228,0003.33R2.5 finalcheckpoints_2048/
Checkpoint format note. RunPod checkpoints were saved under torch.compile(), which prefixes all state dict keys with _orig_mod.. When loading on a non-compiled model, strip this prefix: cleaned = {k.replace("_orig_mod.", ""): v for k, v in state_dict.items()}. This cost ~1 hour of debugging during SFT — do not repeat this mistake.

17.3 Reproduction commands

Step 1: Prepare Round 1 data (local)

python -m tiny_llm.data                    # tokenizes 8 files → tiny_llm/data/train.bin (190 MB, ~2.5 min)

Step 2: Run validation suite

python -m tiny_llm.test_everything          # 60 tests, all must pass before training

Step 3: Round 1 training (RunPod H100)

# Upload project to RunPod, then:
python -m tiny_llm.train --resume None      # 50K steps, ~2.0 hours, loss → 2.62
                                             # Config: batch=128, lr=3e-4→3e-5, bfloat16, torch.compile

Step 4: Prepare Round 2 data

python -m tiny_llm.data_full                # tokenizes all 27 files (22 GB) → train_full.bin (streaming, ~15 min)

Step 5: Round 2 training (RunPod H100)

nohup python -m tiny_llm.train \
    --resume tiny_llm/checkpoints/step_050000.pt \
    --max-steps 228000 \
    --data tiny_llm/data/train_full.bin \
    > training_round2.log 2>&1 &           # 228K steps, ~10.5 hours, loss → 3.46

Step 6: Round 2.5 training — 2048 context (RunPod H100)

# Modify config: max_seq_len=2048, dropout=0.02, batch_size=8, grad_accum=4
nohup python -m tiny_llm.train \
    --resume tiny_llm/checkpoints/step_228000.pt \
    --max-steps 228000 \
    --data tiny_llm/data/train_full.bin \
    --checkpoint-dir tiny_llm/checkpoints_2048 \
    > training_r25.log 2>&1 &              # 228K steps, ~26.5 hours, loss → 3.33

Step 7: V2 pretraining (RunPod H100)

nohup python -u -m tiny_llm.train_v2 \
    > training_v2.log 2>&1 &               # 228K steps, uses config_v2.py + model_v2.py

Step 8: Generate SFT data (local, API-based)

python -m erp_rag.generate.sft_generate \
    --provider anthropic \
    --model claude-sonnet-4-6                # 707 groups → ~8K-10K Q&A pairs, saves to sft_raw_pairs.json

17.4 Experiments tried & decisions locked

DO NOT RE-TRY: The following experiments and configurations were already evaluated. They are documented here to prevent wasting compute on repeating them.
ExperimentWhat Was TriedResultDecision
RoPE vs ALiBi RoPE implemented and tested before ALiBi Unstable long-context; requires scaling hacks ALiBi — locked
MHA (8:8) vs GQA (8:2) Both tested for parameter budget GQA saves 1.18M params (4.8%) with near-MHA quality GQA 4:1 — locked
Local training (MPS) M4 MacBook, batch 4×8 accum, float32 3,500 ms/step; CPU hit 93°C; passive cooling throttle RunPod H100 only — locked
Round 1: 50K steps, 440 MB Curated subset (8 files, 67.7% Wikipedia) Loss 2.62; grammar + basic vocabulary learned Sufficient for pipeline validation; do not extend
Round 2: 228K steps, 22 GB Full corpus, 27 files, 11 domains Loss 3.46 (best 3.39 at 200K); loss still decreasing Capacity-limited at 26M params; more steps = diminishing returns
Loss plateau observation R2 loss ~3.60 at step 95K, ~3.46 at 228K 0.14 improvement over 133K steps = 573× Chinchilla-optimal Model is saturated; proceed to SFT, not more pretraining
Wikipedia cap (Round 1) Capped Wikipedia at 300 MB (of 866 MB) Prevented encyclopedic bias; 67.7% is already dominant Uncapped in Round 2 (full corpus diversity dilutes bias)
Dropout during pretraining dropout = 0.0 for both rounds Model is underfitting (capacity-limited), not overfitting dropout 0.0 for pretraining — locked. Dropout 0.05 for SFT only.
LR schedule fresh restart for R2 Fresh cosine 3e-4 → 3e-5 from step 50K checkpoint Loss continued decreasing; good decision Do not resume with decayed LR; always fresh schedule for new data
Tokenizer versions 8 tokenizer variants tested: 16K, 32K (v1/v2), 48K (v1/v2), 64K (v1/v3) 64K v3 won (~14% fewer tokens than alternatives) 64K v3 — locked. Do not retrain tokenizer.
Round 2.5: 2048 context Same v1 model, max_seq_len 512→2048, dropout 0.02, batch 8×4 Loss 3.33 (best 3.22), 0.17 improvement over R2 Required for RAG. OOM fixed with batch reduction + grad accum.
V2 architecture: 67.6M d_model 256→512, n_kv_heads 2→4, d_ff 688→1376 4.2× more transformer params, balanced embed ratio Width over depth — locked. RAG context converter design.
SFT API model comparison Pilot: Sonnet 4, GPT-5.2, Opus 4.6, Sonnet 4.6 on same 10 groups Sonnet 4.6 best: diverse questions, deep inference, minimal repetition Claude Sonnet 4.6 for all SFT data generation — locked.
SFT data: intentional wrong answers Earlier experiment: inject incorrect answers mid-response for self-correction Terrible results — overall quality degraded significantly Never inject bad answers into SFT. Use DPO preference pairs if needed.

17.5 Training logs

Log FileSizeContents
runpod_backup/training.log454 KBRound 1: 5,422 lines, steps 0–50,000
runpod_backup/training_round2.log2.0 MBRound 2: steps 50,001–228,000, 10.5 hours
training_r25.log~4 MBRound 2.5: steps 0–228,000, 26.5 hours, 2048 context
sft_generation.log~2 MBSFT data generation: 707 API calls, pair counts, errors
Scaling verdict. The v1 (24.7M) model confirmed it is capacity-bound: three training rounds (R1 + R2 + R2.5) on 22 GB of data pushed loss to 3.22 but with severely diminishing returns. The v2 architecture (67.6M) addresses this with 4.2× more transformer parameters while reusing the identical corpus and training pipeline — no data preparation changes required.

18. PROJECT STATUS

PhaseStatusKey Result
Phase 1: TokenizerCOMPLETE64K vocab, ~14% fewer tokens than Kumru/TabiBERT, ~2.7× vs GPT-4
Phase 2: Architecture (v1)COMPLETE24.7M params, ALiBi, GQA, SwiGLU, RMSNorm, weight tying; 60/60 tests
Phase 3a: Pretrain R1COMPLETE50K steps, 440 MB subset, loss 2.62, 2.0 hours, $4.76
Phase 3b: Pretrain R2COMPLETE228K steps, 22 GB full corpus, 14.9B tokens, loss 3.46, 10.5 hours, $24.99
Phase 3c: Pretrain R2.5COMPLETE228K steps, 2048 context, loss 3.33 (best 3.22), 26.5 hours, $63.08
Phase 2b: Architecture (v2)COMPLETE67.6M params, d_model=512, 2:1 GQA, 4.2× transformer capacity
Phase 3d: Pretrain V2IN PROGRESS228K steps on same 22 GB corpus, ~14.9B tokens
Phase 4a: SFT v1 (initial)COMPLETE3,790 QA pairs, val loss 5.12 → 3.10, 100 seconds
Phase 4b: SFT data gen (v2)COMPLETE707 groups, 11 rules, Claude Sonnet 4.6 — 7,595 QA pairs
Phase 4c: SFT training (v2)IN PROGRESSRAG-grounded SFT with 7,595 pairs on v2 model
Phase 5: RL (optional)FUTUREDPO or RLVR if SFT alone is insufficient
$92.83
TOTAL V1 COST (3 ROUNDS)
39h
V1 TRAINING TIME
~44.7B
TOKENS PROCESSED (V1)
2
MODEL ARCHITECTURES
Conclusion. Two generations of Turkish Transformer have been designed, validated, and pretrained from scratch. The v1 architecture (24.7M params) was trained across three rounds: Round 1 (50K steps, 440 MB) established basic Turkish fluency at loss 2.62; Round 2 (228K steps, 22 GB, 11 domains) pushed to factual, correct Turkish at loss 3.46; Round 2.5 (228K steps, 2048-token context) adapted for RAG at loss 3.33 (best 3.22). A 60-test validation suite caught a critical causal mask bug before any training began. The v2 architecture (67.6M params) was designed as a purpose-built RAG context converter: d_model=512, 2:1 GQA, 4.2× more transformer parameters, with width prioritized over depth for context comprehension. A parallel SFT data generation pipeline using Claude Sonnet 4.6 produced 7,595 high-quality Turkish Q&A pairs from 532 ERP documentation chunks via 11 grouping strategies. Total v1 pretraining cost: $92.83 (39 hours on H100 at $2.38/hour).

This report documents Phases 2 and 3 of an independent effort to build a Turkish language-native LLM from scratch. Phase 1 (Tokenizer) and Phase 4 (SFT) are documented separately.

© 2026 • Independent Research