Back to Research

ARCHITECTURE & PRETRAINING

Phase 2 & 3 — Two Generations of Turkish Transformers: 24.7M (v1) & 67.6M (v2) from Scratch

February 2026 • Independent Research • IN PROGRESS

24.7M

V1 PARAMETERS

67.6M

V2 PARAMETERS

V1 TRAINING ROUNDS

3.22

BEST LOSS (V1, R2.5)

~44.7B

TOKENS (V1 TOTAL)

$92.83

V1 TRAINING COST

Abstract. This report documents the design, implementation, and pretraining of a 24.7M-parameter decoder-only Transformer for Turkish — built entirely from scratch in PyTorch without any pretrained weights or HuggingFace model code. The architecture incorporates five deliberate departures from convention: ALiBi positional encoding (rejecting RoPE), Grouped Query Attention at 4:1 ratio, SwiGLU activation, RMSNorm with pre-norm placement, and weight tying between embedding and output projection. Each decision is justified against specific alternatives with parameter-budget and stability considerations. A 60-test validation suite uncovered a critical causal mask bug in the ALiBi implementation prior to training — demonstrating the necessity of component-level verification in from-scratch implementations. Training proceeded in three rounds: Round 1 on a ~440 MB curated subset (50K steps, loss 2.62, 2.0 hours) established basic Turkish fluency; Round 2 on the full 22 GB corpus across 11 domains (228K steps, 14.9B token-reads, 10.5 hours) pushed the model to factual, correct Turkish at loss 3.46 (best 3.39); Round 2.5 retrained with 2048-token context for RAG compatibility (228K steps, loss 3.33, best 3.22, 26.5 hours). A second-generation architecture (v2, 67.6M params) was designed specifically for RAG: d_model=512, 2:1 GQA, 4.2× more transformer parameters, optimized as a context converter rather than a knowledge base. All rounds ran on NVIDIA H100 ($2.38/hour) with bfloat16 mixed precision and torch.compile. Total v1 pretraining cost: approximately $92.83.

Document structure. Sections 1–14 cover the v1 architecture (24.7M params) and its two initial training rounds. Section 15 documents the 2048-context extension (Round 2.5). Section 16 introduces the v2 architecture (67.6M params), a purpose-built RAG model. All architectural decisions in Sections 3–6 (ALiBi, GQA, SwiGLU, RMSNorm) carry forward to v2 — v2 scales the dimensions, not the design.

1. Motivation: From Tokenizer to Model 2. V1 Architecture Overview 3. Positional Encoding: ALiBi (Not RoPE) 4. Grouped Query Attention 5. SwiGLU Feed-Forward Network 6. Normalization & Residual Design 7. V1 Parameter Budget Analysis 8. Training Corpus Curation 9. Data Pipeline 10. V1 Training Configuration 11. Validation Suite: 60 Paranoid Tests 12. Infrastructure: MPS to H100 13. V1 Round 1 Results 14. V1 Round 2: Full Corpus (22 GB) 15. Round 2.5: 2048-Context for RAG 16. V2 Architecture: 67.6M RAG-Optimized 17. Reproducibility & Experiment Log 18. Project Status

1. MOTIVATION: FROM TOKENIZER TO MODEL

Phase 1 produced a 64K Turkish BPE tokenizer that is ~14% more efficient than existing Turkish tokenizers and ~2.7× more efficient than GPT-4 on Turkish text. The natural question follows: can this tokenizer serve as the foundation for a purpose-built Turkish language model?

The objective is not to compete with production LLMs. The objective is threefold: (1) validate the tokenizer in an end-to-end training pipeline, (2) establish every component of the training infrastructure from scratch, and (3) produce models that demonstrably learn Turkish language patterns and can serve as domain-specific RAG assistants. This led to two generations: v1 (24.7M params) to prove the pipeline works, and v2 (67.6M params) purpose-built for RAG context comprehension.

64K TOKENIZER → V1: 24.7M → R1 + R2 + R2.5 → V1 SFT → V2: 67.6M → V2 PRETRAIN → V2 SFT → RLVR

2. V1 ARCHITECTURE OVERVIEW

The v1 model is a decoder-only Transformer with 24,697,088 parameters. Every component was selected against specific alternatives; no default was accepted without justification. All architectural decisions below (ALiBi, GQA, SwiGLU, RMSNorm, weight tying) carry forward to v2 — only the dimensions change.

Component	Choice	Rejected Alternative	Rationale
Architecture	Decoder-only	Encoder-decoder, Encoder-only	Autoregressive generation is the goal; encoder stack adds unnecessary cross-attention
Position encoding	ALiBi	RoPE, Learned, Sinusoidal	Zero learned params; train-short-test-long generalization; RoPE unreliable at extrapolation
Attention	GQA (4:1)	MHA (8:8), MQA (8:1)	75% KV parameter reduction vs MHA; retains multi-view capacity unlike MQA
FFN activation	SwiGLU	ReLU, GELU, GeGLU	Gated mechanism outperforms per-parameter; 3 projections at (8/3)×d maintains budget
Normalization	RMSNorm	LayerNorm, BatchNorm	~10-15% faster per layer; mean subtraction is redundant with pre-norm residual
Norm placement	Pre-norm	Post-norm	Unobstructed residual gradient path; stable training without careful LR tuning
Output projection	Weight tying	Separate lm_head	Saves 16.4M parameters (66% of model); embedding serves dual purpose
Linear layers	No bias	With bias	Redundant with RMSNorm re-centering; simplifies weight decay
Dropout	0.0	0.1–0.3	Pretraining goal is to absorb data, not regularize; underfitting is the risk

Configuration

Parameter	Value	Derivation
`vocab_size`	64,000	Phase 1 tokenizer; 64K × 256 = 16.4M embedding params
`d_model`	256	Minimum for head_dim=32 with 8 heads
`n_layers`	12	Depth over width at small scale (SmolLM2 finding)
`n_heads`	8	8 attention patterns; head_dim = 256/8 = 32
`n_kv_heads`	2	GQA 4:1 ratio; 4 query heads share each KV head
`d_ff`	688	≈ (8/3) × 256; preserves FFN param budget with SwiGLU’s 3 matrices
`max_seq_len`	512	ALiBi generalizes beyond training length; 512 is memory-safe on consumer GPU
`dropout`	0.0	No regularization during pretraining
`weight_tying`	true	Embedding = LM head; saves 16.4M params

3. POSITIONAL ENCODING: ALiBi (NOT ROPE)

Positional encoding informs the model where tokens are in the sequence. The dominant approach in 2024–2026 is Rotary Position Embeddings (RoPE), used by Llama, Mistral, and Qwen. This work rejects RoPE in favor of ALiBi (Attention with Linear Biases, Press et al. 2022).

Why not RoPE

RoPE applies rotation matrices to query and key vectors based on position. While mathematically elegant, it exhibits two practical problems: (1) poor extrapolation beyond training length without ad-hoc scaling hacks (NTK-aware, YaRN, etc.), and (2) additional learned parameters that interact with the attention computation in ways that are difficult to debug at small scale. Prior empirical observation on this project confirmed unstable long-context behavior with RoPE.

ALiBi mechanism

ALiBi adds a linear distance penalty directly to attention scores. No parameters are learned. Each attention head receives a different slope (geometric sequence), creating a spectrum from sharp local attention to broad distant attention.

Property	ALiBi	RoPE	Learned	Sinusoidal
Learned parameters	0	Implicit (rotations)	seq_len × d_model	0
Length generalization	Train short, test long	Requires scaling hacks	Hard-capped	Degrades
Implementation complexity	Low (bias matrix)	Medium (rotations)	Low	Low
Industry adoption	MPT, BLOOM	Llama, Mistral, Qwen	GPT-2	Original Transformer

Slope computation

For n attention heads, slopes form a geometric sequence: slope_i = 2^-8i/n for i ∈ {1, ..., n}. With 8 heads, slopes range from 0.5 (sharp, local focus) to 2^-8 ≈ 0.0039 (broad, distant attention).

Head	1	2	3	4	5	6	7	8
Slope	0.5	0.25	0.125	0.0625	0.0313	0.0156	0.0078	0.0039
Behavior	Strong local focus			Medium range		Broad / distant

BUG FOUND DURING VALIDATION: The initial ALiBi implementation computed relative distances with transposed indices, causing future positions to receive finite penalty values instead of hard -inf. The causal mask was effectively leaking — the model could attend to future tokens with a soft distance penalty rather than being blocked entirely. This would have produced a model that appears to train normally but fails catastrophically at inference (where future tokens are unavailable). The bug was caught by test #22 of the validation suite (Section 11) and corrected before training began.

4. GROUPED QUERY ATTENTION

Standard Multi-Head Attention (MHA) assigns independent Key and Value projections to each attention head. Grouped Query Attention (GQA, Ainslie et al. 2023) shares KV projections across groups of query heads, reducing memory and parameter cost without proportional quality loss.

Variant	Q Heads	KV Heads	KV Params/Layer	Quality
MHA (8:8)	8	8	131,072	Maximum
GQA (8:2)	8	2	32,768	Near-MHA
MQA (8:1)	8	1	16,384	Degraded

The 4:1 ratio saves ~98K parameters per layer × 12 layers = 1.18M parameters total versus MHA. At 24.7M total, this represents a 4.8% budget reallocation. Four query heads share each KV head via tensor repetition (_repeat_kv): the KV tensor is expanded along the head dimension without copying data.

Projection dimensions

Projection	Shape	Parameters
Q (query)	256 × 256	65,536
K (key)	256 × 64	16,384
V (value)	256 × 64	16,384
O (output)	256 × 256	65,536
Total attention/layer		163,840

5. SwiGLU FEED-FORWARD NETWORK

The standard Transformer FFN uses two projections with a nonlinear activation: FFN(x) = W₂ · σ(W₁ · x). SwiGLU (Shazeer 2020) replaces this with a gated variant using three projections: FFN(x) = W_down · (SiLU(W_gate · x) ⊙ W_up · x).

Parameter equivalence

Standard FFN with expansion factor 4: 2 × d × 4d = 8d². SwiGLU with 3 matrices: 3 × d × d_ff. Setting 3 × d_ff = 8d gives d_ff = (8/3)d ≈ 688 for d = 256. Total FFN parameters per layer:

FFN Type	Matrices	d_ff	Params/Layer
Standard (ReLU/GELU)	2	1,024	524,288
SwiGLU	3	688	528,384

Nearly identical parameter budget (+0.8%), empirically superior activation function. The gating mechanism allows the network to selectively amplify or suppress information — a capability that standard FFNs lack.

6. NORMALIZATION & RESIDUAL DESIGN

RMSNorm over LayerNorm

LayerNorm (Ba et al. 2016) performs two operations: mean subtraction and variance normalization. RMSNorm (Zhang & Sennrich 2019) drops the mean subtraction entirely, computing only the root mean square. With 12 layers × 2 norms per layer = 24 norm operations per forward pass, the cumulative speedup is measurable. Each RMSNorm has exactly d_model = 256 learnable parameters (the scale vector).

Pre-norm residual structure

Each Transformer block follows the pattern:

x = x + Attention(RMSNorm(x))     # residual path is unobstructed
x = x + FFN(RMSNorm(x))           # gradient flows directly through addition

The alternative — post-norm — places the normalization after the residual addition: x = RMSNorm(x + Attention(x)). This creates a gradient bottleneck through the norm layer. Pre-norm eliminates this bottleneck, producing more stable training in deep networks without requiring careful learning rate tuning.

7. V1 PARAMETER BUDGET ANALYSIS

At 24.7M parameters, every allocation decision is visible. The embedding layer dominates — a structural consequence of pairing a large vocabulary (64K) with a small hidden dimension (256).

Component	Parameters	% of Total	Notes
Token embedding (tied)	16,384,000	66.3%	64,000 × 256; shared with output projection
Attention (all layers)	1,966,080	8.0%	163,840 per layer × 12
FFN / SwiGLU (all layers)	6,340,608	25.7%	528,384 per layer × 12
RMSNorm (all layers + final)	6,400	<0.1%	256 per norm × 25 norms
LM head	0	0%	Weight tying: reuses embedding
Total	24,697,088	100%

66.3%

EMBEDDING (TIED)

25.7%

FFN (SwiGLU)

8.0%

ATTENTION (GQA)

Observation: Embedding dominance (66.3%) is a known property of small language models. Without weight tying, the model would be 41.1M parameters — with 79.7% consumed by two embedding matrices. Weight tying halves this cost while enforcing a useful symmetry: the input representation and output prediction share the same vector space.

Weight initialization

All 2D weight matrices are initialized with Normal(0, 0.02). Output projections (o_proj, down_proj) are additionally scaled by 1/√(2 × n_layers) = 1/√24 ≈ 0.204 to prevent residual stream explosion through 12 layers of additive contributions. This follows the GPT-2 initialization scheme.

8. TRAINING CORPUS CURATION

The full Phase 1 corpus spans 22 GB across 11 domains. For pretraining a 24.7M model, a curated subset of ~440 MB (raw text) was selected. The selection criteria: (1) prioritize raw text over structured data, (2) include reasoning-oriented material, (3) exclude instruction-tuning data (reserved for SFT in Phase 4), (4) cap Wikipedia to prevent encyclopedic dominance.

File	Raw Size	Tokens	% of Corpus	Selection Rationale
wikipedia_tr.txt	310 MB (capped from 866 MB)	67,407,762	67.7%	General knowledge, encyclopedic Turkish
orca_math_tr.txt	117 MB	28,648,305	28.8%	Mathematical reasoning (north star objective)
tdk_full.txt	9.5 MB	2,398,611	2.4%	Official dictionary; proper word usage
turkish_folk_songs.txt	2.3 MB	653,972	0.7%	Colloquial/emotional Turkish
turkish_idioms_proverbs.txt	1.6 MB	369,462	0.4%	Idiomatic language, cultural knowledge
turkish_poems.txt	0.4 MB	100,141	0.1%	Literary Turkish, complex grammar
turkish_mmlu_exams.txt	0.2 MB	40,187	<0.1%	Academic/exam format, broad vocabulary
literary_short_stories.txt	0.1 MB	18,596	<0.1%	Narrative structure, dialogue
Total	~441 MB	99,637,036	100%

Excluded data in Round 1 (later used in Round 2)

File	Size	R1 Exclusion Rationale	R2 Status
instruc_turca.txt	3.7 GB	Instruction data; format learning before language learning is counterproductive	Included
rag_dataset_tr.txt	31 MB	ERP-domain-specific; too narrow for general pretraining	Included
Wikipedia (remaining)	556 MB	Capped at 300 MB to prevent encyclopedic bias	Uncapped
Academic, legal, medical…	~16 GB	Not yet collected at time of Round 1	Included

Round 2 used the full corpus. All 22 GB / 27 files across 11 domains were included in Round 2 (Section 14). The curated subset approach worked well for Round 1 to validate the pipeline; Round 2 removed all data restrictions to maximize the model’s exposure to diverse Turkish text.

Data composition principle: Chinchilla scaling (Hoffmann et al. 2022) suggests ~20N optimal training tokens for an N-parameter model. For 24.7M parameters, optimal is ~500M tokens. The corpus contains 99.6M tokens; with 50K steps × 65,536 tokens/step = 3.27B token-reads, each token is seen ~33 times. This level of repetition is acceptable for a model at this scale given the diversity of 8 source types.

9. DATA PIPELINE

The pipeline converts raw text to training batches in two offline steps, followed by real-time random sampling during training.

Step 1: Tokenization (offline, one-time)

Raw .txt files are split into documents by double newline, filtered (minimum 20 characters, 5 tokens), wrapped with BOS/EOS markers, and tokenized with the 64K BPE tokenizer. The resulting integer sequence is stored as a contiguous uint16 binary file (190 MB). The choice of uint16 is exact: vocab_size = 64,000 < 65,535 (max uint16), using exactly 2 bytes per token with no waste.

Step 2: Memory-mapped random sampling

The binary file is memory-mapped via numpy.memmap. Each training sample is drawn by selecting a random starting position and extracting a contiguous window of seq_len + 1 = 513 tokens. The first 512 tokens serve as input; the last 512 (shifted by one) serve as targets. This approach has no concept of epochs — with 99.6M tokens and 512-token windows, the number of possible starting positions (~194K) vastly exceeds any practical training run.

Pipeline Stage	Input	Output	Time
Tokenize (64K BPE)	441 MB raw text (8 files)	190 MB uint16 binary	~2.5 min
Memory map	190 MB binary	Random access via OS page cache	Instant
Sample	Random offset	(512 input, 512 target) tensor pair	<1 ms

10. V1 TRAINING CONFIGURATION

Hyperparameter	Value	Justification
Optimizer	AdamW	Decoupled weight decay; industry standard for Transformers
β₁, β₂	0.9, 0.95	β₂=0.95 (not 0.999): faster adaptation to changing gradient landscape
Learning rate	3 × 10^-4	Scaling law: smaller models tolerate higher LR
Min learning rate	3 × 10^-5	10:1 ratio; prevents complete learning cessation in final steps
LR schedule	Cosine decay	More time at peak LR than linear; Chinchilla standard
Warmup	500 steps	~1% of training; stabilizes optimizer momentum estimates
Weight decay	0.1	Applied to 2D+ matrices only; norms exempt (their target is 1.0, not 0.0)
Gradient clipping	1.0 (global norm)	Prevents catastrophic updates from gradient spikes; direction preserved
Max steps	50,000	3.27B token-reads at batch=128; ~33 passes over corpus
Precision	bfloat16 (CUDA) / float32 (MPS)	bfloat16 has float32 range with float16 size; MPS lacks bfloat16 support

Effective batch size

Setting	MPS (MacBook M4)	CUDA (H100 NVL) — Round 1	CUDA (H100 NVL) — Round 2
Physical batch	4	128	128
Gradient accumulation	8	1	1
Effective batch	32	128	128
Tokens per step	16,384	65,536	65,536
Total steps	—	50,000	228,000
Total token-reads	—	3.27B	14.94B

11. VALIDATION SUITE: 60 PARANOID TESTS

Before committing compute to training, every component of the pipeline was validated by a 60-test suite covering 13 categories. The suite was designed to catch the class of bugs that produce models which appear to train normally but fail silently.

Category	Tests	What It Validates
1. Tokenizer	8	Loading, vocab size, special tokens, Turkish encoding, roundtrip, uint16 safety
2. Model Architecture	6	Parameter count, weight tying on/off, layer count, hidden dimension
3. RMSNorm	4	Shape preservation, unit scale, parameter count, zero-input stability
4. ALiBi	7	Geometric slopes, exact values, shape, causal mask integrity, diagonal, distance penalty, generalization
5. GQA	4	Output shape, 4:1 ratio, projection shapes, causal independence
6. SwiGLU	3	Shape, 3 projections, per-layer parameter count
7. Transformer Block	3	Shape, residual connections, pre-norm ordering
8. Full Model	7	Logits shape, loss shape, initial loss sanity, gradient flow, NaN detection, learning verification
9. Generation	4	Token production, max_tokens, determinism, valid ID range
10. Data Pipeline	5	File existence, sizes, tokenizer path, directory writability
11. LR Schedule	3	Warmup ramp, cosine decay endpoint, monotonic decrease
12. Device	3	MPS/CUDA availability, model execution, ALiBi transfer
13. Numerical Stability	3	Edge-case IDs, full-length sequences, gradient accumulation equivalence
Total	60	All passed prior to training

Critical bug caught by test #22 (ALiBi causal mask). The original implementation of build_alibi_bias computed relative distances as

positions.unsqueeze(0) -
            positions.unsqueeze(1)

, which transposed the query/key axes. Future positions received the ALiBi distance penalty (e.g., -0.5) instead of hard -inf. This meant the model could attend to future tokens with a softened penalty rather than being fully masked. Training would appear normal (loss decreasing, gradients stable) but the model would learn to rely on information that is unavailable during autoregressive generation. The fix: separate the causal mask (hard -inf for future) from the distance penalty (soft negative values for past), using positions.unsqueeze(1) - positions.unsqueeze(0) with explicit clamp(min=0) on distances before applying masked_fill_.

12. INFRASTRUCTURE: MPS TO H100

Training was initiated on Apple M4 (MacBook Air, passive cooling) to validate the pipeline, then migrated to NVIDIA H100 NVL on RunPod for production training.

Metric	M4 MacBook Air (MPS)	H100 NVL (CUDA)	Speedup
GPU	Apple M4 (integrated)	NVIDIA H100 NVL 95 GB	—
Precision	float32	bfloat16 (autocast)	2× throughput
`torch.compile`	Not supported	Enabled	~30-50% speedup
Batch size	4 × 8 accum = 32	128 × 1 = 128	4× tokens/step
Step time	~3,500 ms	~140 ms	25×
Tokens/sec	~5,000	~400,000	80×
VRAM used	~8 GB (shared)	72 GB / 95 GB	—
GPU utilization	~100% (throttled to 80°C)	97% at 53°C	—
Estimated total time	~48 hours	~2 hours	24×
Estimated cost	Electricity only	~$4.76 (RunPod, $2.38/hr)	—

25×

STEP TIME SPEEDUP

400K

TOKENS/SEC (H100)

$4.76

TOTAL TRAINING COST

Migration changes

Three modifications were required to move from MPS to CUDA:

Precision: float32 → bfloat16 via torch.amp.autocast. bfloat16 has the dynamic range of float32 (8 exponent bits) with the memory footprint of float16 (16 bits total), eliminating the need for loss scaling.
Compilation: torch.compile(model) JIT-compiles the model graph into fused CUDA kernels, eliminating Python overhead and enabling kernel fusion across operations.
Batch scaling: Physical batch increased from 4 to 128; gradient accumulation removed. The H100’s 95 GB VRAM accommodates the full effective batch in a single forward pass.

13. V1 ROUND 1 RESULTS (50K STEPS, 440 MB)

COMPLETE — 50,000 steps, 2.0 hours, final loss 2.62. This section documents Round 1 on the ~440 MB curated subset. Round 2 on the full 22 GB corpus is documented in Section 14.

Loss curve

Step	Loss	Perplexity	LR	Tokens/sec	Phase
10	11.07	~64,000	6.00e-06	15,138	Warmup (random guessing, loss ≈ ln(64000))
100	9.61	~15,000	6.00e-05	96K	Warmup (learning token frequencies)
500	6.00	~403	3.00e-04	290K	Warmup complete; peak LR reached
1,000	4.82	~124	3.00e-04	355K	Word boundaries, basic grammar
2,000	3.93	~51	2.99e-04	399K	Common phrases, suffixes
3,000	3.60	~37	2.98e-04	415K	Sentence fragments forming
8,000	3.15	~23	2.85e-04	438K	Mathematical notation, equations
10,000	3.12	~23	2.76e-04	441K	Wikipedia articles, proper nouns
12,000	3.05	~21	2.66e-04	443K	Dictionary format, subordinate clauses
19,000	2.94	~19	2.17e-04	446K	Encyclopedic titles, date suffixes
30,000	2.78	~16	1.48e-04	449K	Cosine decay phase; diminishing returns
50,000	2.62	~14	3.00e-05	451K	Final: filmographies, cultural references

2.62

FINAL LOSS

~14

FINAL PERPLEXITY

3.27B

TOKENS PROCESSED

2.0 hrs

TOTAL TRAINING TIME

Observation: Loss decreased monotonically from 11.07 to 2.62, eliminating 99.97% of initial uncertainty. Perplexity of ~14 means the model narrows 64,000 vocabulary tokens to ~14 plausible candidates at each position. The loss curve exhibits three distinct phases: rapid descent (steps 0–5K, dominated by frequency learning), steady descent (steps 5K–25K, grammatical and structural learning), and plateau (steps 25K–50K, diminishing returns from cosine decay and corpus repetition).

Generated samples across training

Three hardcoded prompts (“Merhaba”, “Türkiye”, “Stok”) were sampled every 1,000 steps at temperature 0.8 with top-k 40. The prompt is selected randomly each time. No rules were programmed; all linguistic structure was learned from next-token prediction alone.

Step	Prompt	Output	Observation
1,000	Stok	Stokon	Single fragment; knows token boundaries
2,000	Merhaba	Merhaba	Recognizes greeting; stops at EOS
3,000	Merhaba	Merhaba, 6. sezon.	First multi-token output; correct grammar and punctuation
8,000	Stok	Stokes) = 5.000, K + J - 3J = 15.000	Mathematical notation from orca_math_tr.txt (28.8% of corpus)
10,000	Merhaba	Merhaba Dünya Kızı, Istanbul’a Gidiyor	Folk song register; correct apostrophe + dative suffix
12,000	Merhaba	Örnek: Gözlerini bu kadar beğenip, iyi bir şey sevdiğine de biraz daha âşık ol	TDK dictionary format; subordinate clauses with -duğunu
19,000	Türkiye	Türkiye’deki etnik gruplar, Moğolistan’ın Yahudi tarihi, 1901’de Türkiye	Wikipedia article titles; correct locative/possessive suffixes
50,000	Merhaba	Cary, “Deli Gömülü” (2001), Hymnogy, “İman İçin Bir Şey” (2002)	Filmography/discography entries with quoted titles and years

Emergent domain switching. The model produces qualitatively different output depending on (a) the prompt token and (b) the training stage. The same prompt “Merhaba” yields a greeting at step 2,000, a folk song at step 10,000, a dictionary entry at step 12,000, and a filmography at step 50,000. No explicit domain labels or curriculum were used. The behavior emerges purely from the statistical structure of the training data as encoded by the 64K tokenizer.

14. V1 ROUND 2: FULL CORPUS (228K STEPS, 22 GB)

Round 1 demonstrated the pipeline worked and the model could learn Turkish from a ~440 MB subset. Round 2 scaled to the full 22 GB corpus — the same corpus used to train the 64K tokenizer in Phase 1 — across all 11 domains: general knowledge, academic, legal, medical, financial, education, news, code, literary, reasoning, and instructions.

228K

TOTAL STEPS

14.9B

TOKENS PROCESSED

10.5h

TRAINING TIME

3.46

FINAL LOSS

3.39

BEST LOSS

Why Round 2?

Round 1 trained on 99.6M tokens (441 MB subset, 67.7% Wikipedia). The model learned grammar and basic vocabulary but lacked domain diversity. Round 2 exposed the model to legal Turkish (court decisions), medical terminology, financial reporting, news journalism, academic writing, and instruction-following patterns — preparing a more robust base for SFT specialization.

Training data: full corpus

Domain	Sources	Size
General Knowledge	Wikipedia TR (520K articles, uncapped)	866 MB
Academic/Thesis	BellaTurca AkademikDerlem (668K papers)	3.5 GB
Cultural/Literary Web	BellaTurca ÖzenliDerlem (1.4M curated docs)	4.4 GB
News/Journalism	1.8M news articles + summarization corpus	4.5 GB
Legal/Law	700K court decisions + Constitutional Court	3.7 GB
Instructions	2.5M instruction-answer pairs	3.7 GB
Code	Python corpus	569 MB
Financial	KAP announcements, capital markets	425 MB
Reasoning	Math problems, RAG, chain-of-thought	221 MB
Medical	Medical reasoning + hospital articles	108 MB
Education & Vocabulary	QA, MMLU exams, TDK dictionary, literature	~100 MB
Total	27 files, 11 domains	22 GB

Round 2 configuration changes

Parameter	Round 1	Round 2	Rationale
Training data	441 MB (8 files)	22 GB (27 files)	Full corpus for maximum domain coverage
Max steps	50,000	228,000	Scaled proportionally to data volume
Batch size	128	128	Unchanged
Starting point	Random init	step_050000.pt	Resume from Round 1 checkpoint
Learning rate	3e-4 → 3e-5	3e-4 → 3e-5	Fresh cosine schedule from peak

Loss curve: Round 2

Step	Loss	LR	Tok/s	Sample Quality
50,000 (R1 end)	2.62	3.0e-05	451K	Filmographies, encyclopedic entries
95,000	~3.60	2.0e-04	399K	Simple sentences, some repetition
145,000	~3.50	1.1e-04	401K	Coherent multi-clause sentences
195,000	~3.47	4.4e-05	402K	Factual: “Kocaeli’ndeyiz”
200,000	3.39	4.0e-05	403K	Best loss — human rights discussion
228,000	3.46	3.0e-05	403K	Final: real-world knowledge, correct grammar

Note on loss difference. Round 1 loss (2.62) and Round 2 loss (3.46) are not directly comparable. Round 1 trained on a ~440 MB curated subset dominated by Wikipedia; Round 2 trained on 22 GB spanning 11 domains including legal, medical, and financial text. The higher absolute loss reflects the much harder prediction task across diverse registers — not a regression. Sample outputs confirm dramatically improved language quality.

Sample evolution: Round 2

Step	Sample Output	Quality
95,000	“Bu e-postayı seviyorum! Bu e-postanın amacı, bu e-postanın ana no…”	Repetitive, generic
145,000	“Türkiye’de ve dünyada ekonomik gelişmeler açısından büyük önem taşımaktadır”	Coherent, meaningful
195,000	“Türkiye’nin en büyük ikinci sanayi şehri konumundaki Kocaeli’ndeyiz”	Factual, specific, correct suffixes
200,000	“Türkiye’de tüm dünyada ‘insan hakları’ndan söz edildiği gibi…”	Complex topic, proper structure

Scaling law observation. At 26M parameters, the v1 model is capacity-limited, not data-limited. Chinchilla-optimal training for 26M params is ~520M tokens (20× params); this model processed 14.9B tokens (~573× params). Loss was still decreasing at step 228K but with diminishing returns — the model had exhausted most of its representational capacity. This capacity ceiling directly motivated the v2 architecture (67.6M, Section 16): 4.2× more transformer parameters to break through the v1’s representational limit.

$4.76

ROUND 1 (2.0h × $2.38)

$24.99

ROUND 2 (10.5h × $2.38)

$29.75

R1+R2 SUBTOTAL

Not the end. Round 2.5 (2048-context extension, $63.08) and the v2 architecture (67.6M) follow. See Sections 15–16 for the continuation.

15. ROUND 2.5: 2048-CONTEXT FOR RAG (228K STEPS)

Round 2 trained with max_seq_len=512, inherited from the initial architecture. However, the RAG use case requires processing system prompt + retrieved context chunk + user question + generated answer in a single sequence. Typical RAG prompts consume 600–1,300 tokens. A 512-token model cannot serve RAG — so Round 2.5 retrained the same 24.7M model with 2048-token context.

228K

TOTAL STEPS

~14.9B

TOKENS PROCESSED

26.5h

TRAINING TIME

3.33

FINAL LOSS

3.22

BEST LOSS

Why Round 2.5?

Round 2’s 512-token context is a hard ceiling for downstream tasks. The SFT phase needs to fit:

System prompt (~38 tokens): ERP sistemi asistanısın. Verilen bağlam bilgilerini kullanarak soruyu yanıtla...
Retrieved context chunk (200–800 tokens): ERP documentation from the RAG retriever
User question (20–60 tokens): Natural Turkish query
Generated answer (50–300 tokens): Model’s response

Total: 308–1,198 tokens per turn. A 512-token model would truncate most inputs. ALiBi’s extrapolation property helps, but training at the target context length yields far better attention patterns. 2048 tokens provides comfortable headroom for even the longest multi-chunk RAG prompts.

Configuration changes from Round 2

Parameter	Round 2	Round 2.5	Rationale
max_seq_len	512	2048	4× context for RAG prompt fitting
dropout	0.0	0.02	Mild regularization; reduces repetitive generation
batch_size	128	8	4× longer sequences = 4× more memory per sample
grad_accum_steps	1	4	Effective batch = 32 (8 × 4). Preserves batch scale.
Starting point	step_050000.pt	step_228000.pt	Resume from Round 2 final checkpoint
Learning rate	3e-4 → 3e-5	3e-4 → 3e-5	Fresh cosine schedule for context adaptation
Training data	22 GB (27 files)	22 GB (27 files)	Same corpus, now with 4× longer windows

OOM resolution. Initial attempt with batch_size=32 (same effective batch as R2) triggered CUDA out-of-memory on H100 80GB. The 4× sequence length quadrupled attention memory. Solution: reduce per-GPU batch to 8, compensate with 4-step gradient accumulation. Effective batch size stays at 32, but each step takes ~2.5× longer due to the longer sequences and accumulation overhead: 418 ms/step vs ~165 ms/step in Round 2.

Loss curve: Round 2.5

Step	Loss	LR	Tok/s	Notes
0 (R2 checkpoint)	~3.46	3.0e-04	158K	Starting from R2 final, fresh LR schedule
~50,000	~3.40	2.6e-04	158K	Model adapting to 4× context windows
~100,000	~3.35	1.9e-04	158K	Steady improvement from longer-range attention
~150,000	~3.28	1.2e-04	158K	Cross-sentence coherence improving
~200,000	3.22	5.0e-05	158K	Best loss — long-range dependencies learned
228,000	3.33	2.0e-05	158K	Final: LR at minimum, slight loss uptick

Loss improvement vs Round 2. Best loss dropped from 3.39 (R2) to 3.22 (R2.5) — a 0.17 improvement despite using the identical corpus and model. The gain comes entirely from longer context: the model now sees 4× more tokens per sample, learning cross-sentence dependencies and paragraph-level coherence that were invisible in 512-token windows. This confirms that context length was a binding constraint for this architecture.

Dropout effect. Adding dropout=0.02 was motivated by observed repetitive generation patterns during R2 sampling. While the primary goal was RAG context extension, the mild dropout provides regularization during the subsequent SFT phase where the small training set (thousands, not billions, of examples) creates overfitting risk. The 0.02 value was chosen conservatively — enough to break repetition loops without degrading pretraining quality.

$63.08

ROUND 2.5 (26.5h × $2.38)

$92.83

TOTAL V1 COST (R1+R2+R2.5)

39h

TOTAL V1 TRAINING TIME

16. V2 ARCHITECTURE: 67.6M RAG-OPTIMIZED MODEL

The v1 model (24.7M params) proved that the training pipeline, tokenizer, and infrastructure work. But 24.7M parameters is severely capacity-limited for a RAG assistant that must read context, understand questions, and generate coherent answers. The v2 architecture was designed from scratch with a single principle: this model is a context converter, not a knowledge base.

Design philosophy. A RAG model doesn’t need to memorize facts — the retriever provides them. What it needs is the ability to read provided context and transform it into accurate answers. This means maximizing attention quality and representation width, not raw parameter count. Width (larger d_model) matters more than depth (more layers) for context comprehension.

Architecture comparison: v1 vs v2

Parameter	v1 (24.7M)	v2 (67.6M)	Change
`d_model`	256	512	2× representation width
`n_layers`	12	12	Same depth
`n_heads`	8	8	Same query heads
`n_kv_heads`	2	4	4:1 → 2:1 GQA, richer attention diversity
`head_dim`	32	64	2× per-head capacity (matches GPT-2, LLaMA standard)
`d_ff`	688	1376	2× FFN capacity
`max_seq_len`	512	2048	Native RAG context length
`dropout`	0.0	0.02	Proven in R2.5
Embedding params	16.4M (66.3%)	32.8M (48.5%)	Balanced, not vocabulary-heavy
Transformer params	8.3M (33.7%)	34.8M (51.5%)	4.2× more compute capacity
Total params	24.7M	67.6M	2.7× total, 4.2× transformer

Parameter budget: v2

32.8M

EMBEDDING (48.5%)

34.8M

TRANSFORMER (51.5%)

2.9M

PER LAYER

512

FINAL NORM

Per-layer breakdown (2,900,992 params)

Component	Computation	Parameters
Q projection	512 × 512	262,144
K projection	512 × 256 (4 KV heads × 64)	131,072
V projection	512 × 256	131,072
O projection	512 × 512	262,144
Attention subtotal		786,432
Gate (SwiGLU)	512 × 1376	704,512
Up (SwiGLU)	512 × 1376	704,512
Down (SwiGLU)	1376 × 512	704,512
FFN subtotal		2,113,536
Norms (attn + ffn)	512 + 512	1,024
Layer total		2,900,992

Key design decision: width over depth. Keeping 12 layers (same as v1) while doubling d_model from 256 to 512 was deliberate. For a context-conversion task, each layer needs enough representational capacity to attend over long RAG contexts (2048 tokens) and capture the relationship between question tokens and answer tokens scattered across the context. Wider layers with 64-dim attention heads (matching LLaMA/GPT-2 standard) provide this. Adding more thin layers would increase depth but not per-layer comprehension — the wrong tradeoff for RAG.

GQA relaxation: 4:1 → 2:1. The v1 model used aggressive 4:1 GQA (8 query heads, 2 KV heads) to save parameters at 24.7M scale. With the v2 budget at 67.6M, this was relaxed to 2:1 (8 query heads, 4 KV heads). Each pair of query heads now shares a unique KV head, enabling more diverse attention patterns. This costs an extra ~3.1M parameters across 12 layers but significantly improves the model’s ability to attend to different parts of the context simultaneously — critical for RAG where the answer may depend on information scattered across the chunk.

V2 training configuration

Parameter	v1 (R2/R2.5)	v2	Rationale
Learning rate	3e-4 → 3e-5	1.5e-4 → 1.5e-5	Lower peak for larger model stability
Warmup steps	500	2,000	Longer warmup for 2.7× more parameters
Max steps	228,000	228,000	Same token budget (~14.9B tokens)
Batch size	8 × 4	8 × 4	Effective 32, same as R2.5
Training data	22 GB (27 files)	22 GB (27 files)	Same corpus
Precision	bfloat16	bfloat16	H100 native
Compile	torch.compile	torch.compile	Kernel fusion for speed
Checkpoint dir	checkpoints_2048/	checkpoints_v2/	Separate from v1

Chinchilla analysis for v2. At 67.6M parameters, Chinchilla-optimal training is ~1.35B tokens (20× params). Our 228K-step schedule processes ~14.9B tokens (~220× params), far beyond Chinchilla-optimal. This is intentional: the model will be fine-tuned for a specific RAG task, so over-training on diverse Turkish text builds stronger linguistic foundations than stopping at the compute-optimal point. The full 22 GB corpus provides enough variety to prevent memorization even at 220× over-training.

V1: 24.7M

→

R1: 50K steps

→

R2: 228K steps

→

R2.5: 2048 ctx

→

V2: 67.6M pretrain

→

SFT

→

Deploy

17. REPRODUCIBILITY & EXPERIMENT LOG

Every file, command, and decision is documented below to enable exact reproduction and — equally important — to prevent re-running experiments that were already tried.

17.1 File inventory

File	Size	Purpose
`tiny_llm/config.py`	127 lines	V1 ModelConfig (24.7M) + TrainConfig (hyperparameters)
`tiny_llm/config_v2.py`	130 lines	V2 ModelConfig (67.6M) + TrainConfig (RAG-optimized)
`tiny_llm/model.py`	266 lines	V1 Transformer: ALiBi, GQA, SwiGLU, RMSNorm, weight tying
`tiny_llm/model_v2.py`	~300 lines	V2 Transformer: same architecture, larger dimensions
`tiny_llm/train.py`	283 lines	V1 pretraining loop (R1, R2, R2.5)
`tiny_llm/train_v2.py`	~400 lines	V2 pretraining loop with bfloat16 + torch.compile
`tiny_llm/train_sft_rag.py`	~500 lines	RAG-grounded SFT training (reads sft_raw_pairs.json)
`tiny_llm/sft_data.py`	~130 lines	SFT dataset + assistant-only loss masking
`tiny_llm/data.py`	173 lines	Data pipeline for R1: tokenize 8 curated files → 190 MB
`tiny_llm/data_full.py`	183 lines	Streaming pipeline for R2+: tokenize 27 files (22 GB)
`tiny_llm/test_everything.py`	764 lines	60-test validation suite covering all v1 components
`tiny_llm/generate.py`	108 lines	Text generation from trained checkpoint
`erp_rag/generate/sft_generate.py`	466 lines	API-based SFT data generation (Claude/GPT)
`erp_rag/data/sft_chunk_groups.json`	9.6K lines	Master grouping blueprint (707 groups, 11 rules)
`tokenizers/turkish_bpe_64k/tokenizer.json`	4.7 MB	64K BPE tokenizer (Phase 1 output)
`data/processed/*.txt`	22 GB	27 raw text files across 11 domains

17.2 Checkpoint inventory

Checkpoint	Size	Step	Loss	Round	Location
`step_050000.pt`	283 MB	50,000	2.62	R1 final	`runpod_backup/`
`step_228000.pt`	283 MB	228,000	3.46	R2 final	`runpod_backup/round2_checkpoints/`
`step_228000.pt`	283 MB	228,000	3.33	R2.5 final	`checkpoints_2048/`

Checkpoint format note. RunPod checkpoints were saved under torch.compile(), which prefixes all state dict keys with _orig_mod.. When loading on a non-compiled model, strip this prefix: cleaned = {k.replace("_orig_mod.", ""): v for k, v in state_dict.items()}. This cost ~1 hour of debugging during SFT — do not repeat this mistake.

17.3 Reproduction commands

Step 1: Prepare Round 1 data (local)

python -m tiny_llm.data                    # tokenizes 8 files → tiny_llm/data/train.bin (190 MB, ~2.5 min)

Step 2: Run validation suite

python -m tiny_llm.test_everything          # 60 tests, all must pass before training

Step 3: Round 1 training (RunPod H100)

# Upload project to RunPod, then:
python -m tiny_llm.train --resume None      # 50K steps, ~2.0 hours, loss → 2.62
                                             # Config: batch=128, lr=3e-4→3e-5, bfloat16, torch.compile

Step 4: Prepare Round 2 data

python -m tiny_llm.data_full                # tokenizes all 27 files (22 GB) → train_full.bin (streaming, ~15 min)

Step 5: Round 2 training (RunPod H100)

nohup python -m tiny_llm.train \
    --resume tiny_llm/checkpoints/step_050000.pt \
    --max-steps 228000 \
    --data tiny_llm/data/train_full.bin \
    > training_round2.log 2>&1 &           # 228K steps, ~10.5 hours, loss → 3.46

Step 6: Round 2.5 training — 2048 context (RunPod H100)

# Modify config: max_seq_len=2048, dropout=0.02, batch_size=8, grad_accum=4
nohup python -m tiny_llm.train \
    --resume tiny_llm/checkpoints/step_228000.pt \
    --max-steps 228000 \
    --data tiny_llm/data/train_full.bin \
    --checkpoint-dir tiny_llm/checkpoints_2048 \
    > training_r25.log 2>&1 &              # 228K steps, ~26.5 hours, loss → 3.33

Step 7: V2 pretraining (RunPod H100)

nohup python -u -m tiny_llm.train_v2 \
    > training_v2.log 2>&1 &               # 228K steps, uses config_v2.py + model_v2.py

Step 8: Generate SFT data (local, API-based)

python -m erp_rag.generate.sft_generate \
    --provider anthropic \
    --model claude-sonnet-4-6                # 707 groups → ~8K-10K Q&A pairs, saves to sft_raw_pairs.json

17.4 Experiments tried & decisions locked

DO NOT RE-TRY: The following experiments and configurations were already evaluated. They are documented here to prevent wasting compute on repeating them.

Experiment	What Was Tried	Result	Decision
RoPE vs ALiBi	RoPE implemented and tested before ALiBi	Unstable long-context; requires scaling hacks	ALiBi — locked
MHA (8:8) vs GQA (8:2)	Both tested for parameter budget	GQA saves 1.18M params (4.8%) with near-MHA quality	GQA 4:1 — locked
Local training (MPS)	M4 MacBook, batch 4×8 accum, float32	3,500 ms/step; CPU hit 93°C; passive cooling throttle	RunPod H100 only — locked
Round 1: 50K steps, 440 MB	Curated subset (8 files, 67.7% Wikipedia)	Loss 2.62; grammar + basic vocabulary learned	Sufficient for pipeline validation; do not extend
Round 2: 228K steps, 22 GB	Full corpus, 27 files, 11 domains	Loss 3.46 (best 3.39 at 200K); loss still decreasing	Capacity-limited at 26M params; more steps = diminishing returns
Loss plateau observation	R2 loss ~3.60 at step 95K, ~3.46 at 228K	0.14 improvement over 133K steps = 573× Chinchilla-optimal	Model is saturated; proceed to SFT, not more pretraining
Wikipedia cap (Round 1)	Capped Wikipedia at 300 MB (of 866 MB)	Prevented encyclopedic bias; 67.7% is already dominant	Uncapped in Round 2 (full corpus diversity dilutes bias)
Dropout during pretraining	dropout = 0.0 for both rounds	Model is underfitting (capacity-limited), not overfitting	dropout 0.0 for pretraining — locked. Dropout 0.05 for SFT only.
LR schedule fresh restart for R2	Fresh cosine 3e-4 → 3e-5 from step 50K checkpoint	Loss continued decreasing; good decision	Do not resume with decayed LR; always fresh schedule for new data
Tokenizer versions	8 tokenizer variants tested: 16K, 32K (v1/v2), 48K (v1/v2), 64K (v1/v3)	64K v3 won (~14% fewer tokens than alternatives)	64K v3 — locked. Do not retrain tokenizer.
Round 2.5: 2048 context	Same v1 model, max_seq_len 512→2048, dropout 0.02, batch 8×4	Loss 3.33 (best 3.22), 0.17 improvement over R2	Required for RAG. OOM fixed with batch reduction + grad accum.
V2 architecture: 67.6M	d_model 256→512, n_kv_heads 2→4, d_ff 688→1376	4.2× more transformer params, balanced embed ratio	Width over depth — locked. RAG context converter design.
SFT API model comparison	Pilot: Sonnet 4, GPT-5.2, Opus 4.6, Sonnet 4.6 on same 10 groups	Sonnet 4.6 best: diverse questions, deep inference, minimal repetition	Claude Sonnet 4.6 for all SFT data generation — locked.
SFT data: intentional wrong answers	Earlier experiment: inject incorrect answers mid-response for self-correction	Terrible results — overall quality degraded significantly	Never inject bad answers into SFT. Use DPO preference pairs if needed.

17.5 Training logs

Log File	Size	Contents
`runpod_backup/training.log`	454 KB	Round 1: 5,422 lines, steps 0–50,000
`runpod_backup/training_round2.log`	2.0 MB	Round 2: steps 50,001–228,000, 10.5 hours
`training_r25.log`	~4 MB	Round 2.5: steps 0–228,000, 26.5 hours, 2048 context
`sft_generation.log`	~2 MB	SFT data generation: 707 API calls, pair counts, errors

Scaling verdict. The v1 (24.7M) model confirmed it is capacity-bound: three training rounds (R1 + R2 + R2.5) on 22 GB of data pushed loss to 3.22 but with severely diminishing returns. The v2 architecture (67.6M) addresses this with 4.2× more transformer parameters while reusing the identical corpus and training pipeline — no data preparation changes required.

18. PROJECT STATUS

Phase	Status	Key Result
Phase 1: Tokenizer	COMPLETE	64K vocab, ~14% fewer tokens than Kumru/TabiBERT, ~2.7× vs GPT-4
Phase 2: Architecture (v1)	COMPLETE	24.7M params, ALiBi, GQA, SwiGLU, RMSNorm, weight tying; 60/60 tests
Phase 3a: Pretrain R1	COMPLETE	50K steps, 440 MB subset, loss 2.62, 2.0 hours, $4.76
Phase 3b: Pretrain R2	COMPLETE	228K steps, 22 GB full corpus, 14.9B tokens, loss 3.46, 10.5 hours, $24.99
Phase 3c: Pretrain R2.5	COMPLETE	228K steps, 2048 context, loss 3.33 (best 3.22), 26.5 hours, $63.08
Phase 2b: Architecture (v2)	COMPLETE	67.6M params, d_model=512, 2:1 GQA, 4.2× transformer capacity
Phase 3d: Pretrain V2	IN PROGRESS	228K steps on same 22 GB corpus, ~14.9B tokens
Phase 4a: SFT v1 (initial)	COMPLETE	3,790 QA pairs, val loss 5.12 → 3.10, 100 seconds
Phase 4b: SFT data gen (v2)	COMPLETE	707 groups, 11 rules, Claude Sonnet 4.6 — 7,595 QA pairs
Phase 4c: SFT training (v2)	IN PROGRESS	RAG-grounded SFT with 7,595 pairs on v2 model
Phase 5: RL (optional)	FUTURE	DPO or RLVR if SFT alone is insufficient

$92.83

TOTAL V1 COST (3 ROUNDS)

39h

V1 TRAINING TIME

~44.7B

TOKENS PROCESSED (V1)

MODEL ARCHITECTURES

Conclusion. Two generations of Turkish Transformer have been designed, validated, and pretrained from scratch. The v1 architecture (24.7M params) was trained across three rounds: Round 1 (50K steps, 440 MB) established basic Turkish fluency at loss 2.62; Round 2 (228K steps, 22 GB, 11 domains) pushed to factual, correct Turkish at loss 3.46; Round 2.5 (228K steps, 2048-token context) adapted for RAG at loss 3.33 (best 3.22). A 60-test validation suite caught a critical causal mask bug before any training began. The v2 architecture (67.6M params) was designed as a purpose-built RAG context converter: d_model=512, 2:1 GQA, 4.2× more transformer parameters, with width prioritized over depth for context comprehension. A parallel SFT data generation pipeline using Claude Sonnet 4.6 produced 7,595 high-quality Turkish Q&A pairs from 532 ERP documentation chunks via 11 grouping strategies. Total v1 pretraining cost: $92.83 (39 hours on H100 at $2.38/hour).

This report documents Phases 2 and 3 of an independent effort to build a Turkish language-native LLM from scratch. Phase 1 (Tokenizer) and Phase 4 (SFT) are documented separately.

ARCHITECTURE & PRETRAINING

TABLE OF CONTENTS

1. MOTIVATION: FROM TOKENIZER TO MODEL

2. V1 ARCHITECTURE OVERVIEW

Configuration

3. POSITIONAL ENCODING: ALiBi (NOT ROPE)

Why not RoPE

ALiBi mechanism

Slope computation

4. GROUPED QUERY ATTENTION

Projection dimensions

5. SwiGLU FEED-FORWARD NETWORK

Parameter equivalence

6. NORMALIZATION & RESIDUAL DESIGN

RMSNorm over LayerNorm

Pre-norm residual structure

7. V1 PARAMETER BUDGET ANALYSIS

Weight initialization

8. TRAINING CORPUS CURATION

Excluded data in Round 1 (later used in Round 2)

9. DATA PIPELINE

Step 1: Tokenization (offline, one-time)

Step 2: Memory-mapped random sampling

10. V1 TRAINING CONFIGURATION

Effective batch size

11. VALIDATION SUITE: 60 PARANOID TESTS

12. INFRASTRUCTURE: MPS TO H100

Migration changes

13. V1 ROUND 1 RESULTS (50K STEPS, 440 MB)

Loss curve

Generated samples across training

14. V1 ROUND 2: FULL CORPUS (228K STEPS, 22 GB)

Why Round 2?

Training data: full corpus

Round 2 configuration changes

Loss curve: Round 2

Sample evolution: Round 2

15. ROUND 2.5: 2048-CONTEXT FOR RAG (228K STEPS)

Why Round 2.5?

Configuration changes from Round 2

Loss curve: Round 2.5

16. V2 ARCHITECTURE: 67.6M RAG-OPTIMIZED MODEL

Architecture comparison: v1 vs v2

Parameter budget: v2

Per-layer breakdown (2,900,992 params)

V2 training configuration

17. REPRODUCIBILITY & EXPERIMENT LOG

17.1 File inventory

17.2 Checkpoint inventory

17.3 Reproduction commands

17.4 Experiments tried & decisions locked

17.5 Training logs

18. PROJECT STATUS