SUPERVISED FINE-TUNING (SFT)
Phase 4 — From 3,790 to 10,000 QA Pairs: Two Generations of ERP Assistant Training
TABLE OF CONTENTS
1. THE FULL PIPELINE
The model follows a standard modern LLM training pipeline. Each phase builds on the previous one, progressively narrowing the model’s capabilities from general language understanding to domain-specific instruction following.
Two generations of SFT are documented here: v1 (24.7M, proof of concept) and v2 (67.6M, production pipeline).
| Phase | Purpose | Data | Result |
|---|---|---|---|
| Tokenizer | Efficient Turkish text encoding | 22 GB, 11 domains | 64K vocab, 2.7× vs GPT-4 |
| V1 Architecture | Initial model design | — | 24.7M params, ALiBi/GQA/SwiGLU |
| Pretrain R1+R2 | Basic → deep Turkish | 22 GB corpus | 278K steps, loss 3.46 |
| V1 SFT | ERP domain proof-of-concept | 3,790 QA pairs (Opus 4.5) | 639 steps, val loss 5.12 → 3.10 |
| Pretrain R2.5 | 2048-token context for RAG | 22 GB corpus | 228K steps, loss 3.22 (best) |
| V2 Architecture | RAG-optimized model | — | 67.6M params, d_model=512, 2:1 GQA |
| V2 Pretrain | Full 67.6M pretraining | 22 GB corpus | IN PROGRESS |
| V2 SFT data | RAG-grounded ERP assistant | 7,595 pairs (Sonnet 4.6) | COMPLETE |
| RL (optional) | DPO / RLVR if SFT insufficient | TBD | FUTURE |
2. PRETRAINING HISTORY
The v1 model went through three pretraining rounds before SFT. Round 2 (shown below) established deep language understanding. Round 2.5 later extended context to 2048 tokens for RAG. The v2 model (67.6M) is a separate architecture pretrained from scratch. Full pretraining details are in the Architecture & Pretraining report.
Loss curve progression
| Step | Loss | LR | Tok/s | Sample Quality |
|---|---|---|---|---|
| 50,000 (R1) | 2.62 | 3.0e-04 | ~16K (MPS) | Basic Turkish grammar |
| 95,000 | ~3.60 | 2.0e-04 | ~399K (H100) | Simple sentences, some repetition |
| 145,000 | ~3.50 | 1.1e-04 | ~401K | Coherent sentences with context |
| 195,000 | ~3.47 | 4.4e-05 | ~402K | Factual content: “Kocaeli’ndeyiz” |
| 228,000 | 3.46 | 3.0e-05 | ~403K | Real-world knowledge, correct grammar |
Sample evolution during Round 2
max_seq_len=2048 and
dropout=0.02 for RAG compatibility, achieving best loss 3.22 in 26.5 hours.
The v2 architecture (67.6M, d_model=512, 2:1 GQA) was then designed to break the
capacity ceiling — 4.2× more transformer parameters. See
Section 15 and
Section 16 of the Architecture report.
3. ERP DOCUMENTATION SOURCE
The SFT training data was generated from the Solen Kablo ERP system documentation — the same system the model is being built to assist with. The documentation was already pre-processed into structured chunks with rich metadata as part of a RAG (Retrieval-Augmented Generation) pipeline.
Modules covered
| Module | Description | Chunks (TR) | Chunks (EN) |
|---|---|---|---|
| Admin | User management, roles, authentication, system settings | ~80 | ~80 |
| Hammadde | Raw materials: purchase orders, QR tracking, supplier management | ~90 | ~85 |
| Stok | Inventory: warehouse management, stock levels, movements | ~85 | ~85 |
| Teknik | Cable database: specifications, standards, production recipes | ~95 | ~90 |
| Lab | Quality control: test procedures, measurements, certificates | ~60 | ~55 |
| Üretim | Production: work orders, machine management, scheduling | ~50 | ~50 |
| Satış | Sales: customer orders, quotations, delivery tracking | ~40 | ~45 |
| Finans | Finance: invoicing, payments, cost analysis | ~32 | ~52 |
| Total | 532 | 542 |
Chunk metadata
Each chunk contains structured metadata used to guide the QA generation:
| Field | Purpose | Example |
|---|---|---|
chunk_id | Unique identifier for resume capability | erp-mod-hammadde-tr_chunk_042 |
module | ERP module name | Hammadde (Raw Materials Management) |
section_heading | Documentation section | Sipariş Yönetimi |
breadcrumb | Navigation path | Hammadde > Siparişler > Yeni Sipariş |
language | Source language | tr or en |
token_count | Chunk size (for filtering) | 186 |
has_table | Contains tabular data | true |
has_code | Contains code/API references | false |
references_modules | Cross-module references | ["Stok", "Üretim"] |
4. SYNTHETIC DATA GENERATION (V1)
The central challenge of SFT for a domain-specific model is data. Hand-writing thousands of QA pairs is impractical; using generic instruction datasets would not teach the model about Solen Kablo’s ERP system. The solution: use a large cloud LLM as a data conversion tool — not a knowledge source — to transform existing ERP documentation into training-ready QA pairs.
V1 design decisions
| Decision | V1 Choice | V2 Change |
|---|---|---|
| LLM | Claude Opus 4.5 | Claude Sonnet 4.6 (better question diversity) |
| Grouping | Individual chunks only | 11 rules: individual, submodule, module, cross-module, etc. |
| Target model | 24.7M, 512 context | 67.6M, 2048 context |
| Focus | User-centric questions | Same + technical questions (no audience restriction) |
| Format | Single + multi-turn (70/30) | Single-turn only (RAG: one question, one answer) |
| Difficulty | Graded, length-calibrated for 26M | Graded, max 200 words (uncapped format) |
| Volume | 3,790 pairs from 320 chunks | ~8–10K pairs from 707 groups |
5. PROMPT ENGINEERING (V1)
The v1 generation pipeline uses a two-part prompt: a comprehensive system prompt (the “Grand Prompt”) that defines the task, formats, rules, and examples; and a per-chunk user message that provides metadata and the documentation text. The v2 prompt is significantly redesigned — see Section 13.
System prompt structure
| Section | Purpose |
|---|---|
| SİSTEM HAKKINDA | Context about the ERP: 8 modules, cable factory, user types |
| TEK GÖREVİN | Single task definition: convert text to user-focused QA |
| HEDEF MODEL HAKKINDA | 26M parameter constraints: concise answers, no long paragraphs |
| İKİ TİP VERİ | Output format: single-turn and multi-turn JSON schemas |
| ZORLUK SEVİYELERİ | Difficulty definitions: kolay (1-2 sent), orta (2-4), zor (4-6) |
| ODAK NOKTASI | User-centric focus: “How do I use the system?” not code details |
| KRİTİK KURALLAR | 12 rules including anti-hallucination, translation, coverage |
| ÖRNEKLER | Good/bad examples showing desired vs undesired output style |
Answer length calibration for v1 (26M parameters)
| Difficulty | Target Length | Question Type | Example |
|---|---|---|---|
| Kolay (Easy) | 1–2 sentences | Single fact: “X nedir?” | “QR kod ne işe yarar?” |
| Orta (Medium) | 2–4 sentences | Process/steps: “X nasıl yapılır?” | “Sisteme yeni bakır girişi nasıl yapılır?” |
| Zor (Hard) | 4–6 sentences | Multi-fact/scenario | “Sipariş tarihi değişirse ve kısmi teslimat yapılmışsa ne yapmalıyım?” |
Bad vs good question examples (from the prompt)
| Type | Question | Problem / Reason |
|---|---|---|
| BAD | “raw_materials tablosunun sütunları nelerdir?” | Too technical — DB schema, not user question |
| BAD | “POST /api/materials endpoint’i ne döndürür?” | API detail — users don’t know endpoints |
| GOOD | “QR kod ne işe yarar?” | User-centric, natural language |
| GOOD | “Sisteme yeni bakır girişi nasıl yapılır?” | Process-focused, practical |
| GOOD | “Operatör kalay teslim aldığında ne yapmalı?” | Scenario-based, role-aware |
6. V1 QA PAIR STATISTICS
The v1 generation script processed 320 out of 529 Turkish chunks before API credits were exhausted (~$20 on Claude Opus 4.5). The resulting dataset was sufficient for the 26M parameter model. For v2 statistics (~8–10K pairs from 707 groups), see Section 13.
Difficulty distribution
Generation cost
| Metric | Value |
|---|---|
| Model used | Claude Opus 4.5 (claude-opus-4-5-20251101) |
| Chunks processed | 320 / 529 Turkish chunks (61%) |
| API cost | ~$20 |
| Generation rate | ~24 items/minute |
| Items per chunk | ~11.7 average |
| Output file | sft_data/erp_qa_pairs.jsonl |
| ChatML file | sft_data/erp_sft_chatml.jsonl (2.64 MB) |
7. CHAT TEMPLATE & TOKENIZATION
The SFT data uses a Llama 3–style chat template, leveraging the special tokens already built into the tokenizer during Phase 1. This was a deliberate design decision: the tokenizer was built with instruction-tuning tokens before the model existed, anticipating this exact use case.
Special tokens used
| Token | ID | Role in Chat Template |
|---|---|---|
<|begin_of_text|> | 0 | Start of conversation |
<|start_header_id|> | 4 | Opens role header (system/user/assistant) |
<|end_header_id|> | 5 | Closes role header |
<|eot_id|> | 6 | End of turn marker |
<|pad|> | 2 | Padding for batched training |
Template structure
<|begin_of_text|><|start_header_id|>system<|end_header_id|> Sen Solen Kablo ERP sisteminin yapay zeka asistanısın... <|eot_id|><|start_header_id|>user<|end_header_id|> QR kod ne işe yarar? <|eot_id|><|start_header_id|>assistant<|end_header_id|> QR kod her hammaddeye atanan benzersiz bir takip kodudur... <|eot_id|>
Multi-turn extension
For multi-turn conversations (2–3 turns), the template simply repeats the user/assistant blocks.
Each turn ends with <|eot_id|>, and the model learns to generate until it produces this token.
8. LOSS MASKING STRATEGY
A critical detail of SFT: loss is computed only on assistant tokens. The model must learn to predict assistant responses, but it should not be penalized for failing to predict the system prompt or user questions (which are given as input, not generated).
Token-level loss mask
Token sequence: [BOS] [HEADER:system] system content... [EOT] [HEADER:user] user question... [EOT] [HEADER:asst] assistant answer... [EOT] Loss mask: -1 -1 -1 -1 -1 ... -1 ← system (ignored) -1 -1 -1 -1 -1 ... -1 ← user (ignored) -1 -1 YES YES YES ... YES ← assistant content + EOT (trained)
The ignore_index=-1 parameter in PyTorch’s cross_entropy function handles this natively.
Positions marked -1 contribute zero to the loss. Only the assistant’s content tokens and the
trailing <|eot_id|> are trained on.
<|start_header_id|>assistant<|end_header_id|>\n\n
prefix is masked. The model doesn’t need to learn to predict the role header — it will always be
provided as part of the prompt template. Training only on content maximizes the signal-to-noise ratio
for any model’s parameter budget. This strategy is used for both v1 (26M) and v2 (67.6M).
9. V1 SFT TRAINING CONFIGURATION
| Parameter | Value | Rationale |
|---|---|---|
| Base checkpoint | step_228000.pt | Best available pretrained model |
| Epochs | 3 | Small dataset benefits from multiple passes |
| Batch size | 8 × 2 accum = 16 | Effective batch of 16 conversations |
| Learning rate | 2e-5 → 2e-6 | ~15× lower than pretraining (3e-4) |
| LR schedule | Cosine with warmup | 63 warmup steps (10% of total) |
| Dropout | 0.05 | Regularization for small dataset (was 0.0 in pretraining) |
| Weight decay | 0.01 | Lower than pretraining (0.1) |
| Gradient clip | 1.0 | Stability |
| Optimizer | AdamW (fresh) | New optimizer state, not resumed from pretraining |
| Validation split | 10% (379 conversations) | Held-out evaluation after each epoch |
10. V1 RESULTS
Per-epoch breakdown
| Epoch | Train Loss | Val Loss | Best? | Time |
|---|---|---|---|---|
| Before SFT | — | 5.12 | 0s | |
| Epoch 1 | 8.53 | 3.34 | ★ | 28s |
| Epoch 2 | 6.55 | 3.14 | ★ | 60s |
| Epoch 3 | 6.07 | 3.10 | ★ Best | 100s |
Train loss is higher than val loss because the training loss is computed per-batch during gradient updates (with dropout active), while validation loss is computed over the full validation set with dropout disabled.
11. V1 MODEL OUTPUTS
Before SFT (pretrained only)
The pretrained model has no concept of the chat template, question-answering, or ERP knowledge. Given the structured prompt, it falls into degenerate repetition.
After SFT (epoch 3)
12. ANALYSIS & LESSONS LEARNED
What v1 SFT proved
- Turkish grammar and morphology are solid — learned during pretraining, preserved through SFT
- Domain vocabulary (hammadde, sipariş, tedarikçi, malzeme) is correctly used
- Chat template pattern (system → user → assistant) is reliably followed
- The model stops generating at the right point (produces
<|eot_id|>) - SFT on a tiny model is fast (<2 minutes) and cheap (<$0.10), enabling rapid iteration
V1 limitations that drove the v2 redesign
| V1 Limitation | Root Cause | V2 Solution |
|---|---|---|
| Repetitive outputs | 26M model capacity + no dropout + 512-token context | 67.6M model + dropout 0.02 + 2048 context |
| Generic, ungrounded answers | No RAG context in training — model guesses from memorized patterns | Every training example includes the source chunk as context |
| Only 60% chunk coverage | API credits exhausted at 320/529 chunks | All 532 chunks processed via 707 groups across 11 rules |
| Single-chunk questions only | No grouping strategy — each chunk processed individually | 11 grouping rules (submodule, cross-module, data flow, etc.) |
| No audience diversity | V1 prompt targeted “office worker” questions only | V2 generates both user-level and technical questions |
| Sonnet 4 / Opus 4.5 API | Good but not optimal for Turkish Q&A diversity | Sonnet 4.6 selected via 4-way comparative pilot |
Key insight: SFT is the critical phase at this scale
13. V2 SFT PIPELINE REDESIGN
The v1 SFT (3,790 pairs from Claude Opus 4.5) proved the concept but exposed limitations: only 320 of 529 chunks were processed before API credits ran out, the grouping strategy was basic (individual chunks only), and the 512-token context couldn’t fit real RAG prompts. The v2 pipeline was redesigned from scratch for the 67.6M RAG-optimized model.
Why redesign?
- Coverage gap: v1 only processed 60% of chunks. v2 processes all 532 chunks in multiple configurations.
- Single-chunk limitation: v1 generated Q&A from individual chunks only. v2 uses 11 grouping rules that combine chunks by submodule, module, cross-module, data flow, and more — producing questions that require synthesizing information across chunks.
- Model upgrade: v2 targets a 67.6M model with 2048-token context, enabling much richer RAG prompts with longer context chunks.
- API model quality: v1 used Claude Opus 4.5. A 4-way comparison showed Claude Sonnet 4.6 produces superior question diversity and inference depth.
Master grouping strategy: 11 rules
Each rule produces a different perspective on the same documentation, forcing diverse question types:
| # | Rule | Groups | Tokens | Description |
|---|---|---|---|---|
| 1 | Individual | 532 | 107K | Each chunk alone — factual, definition, basic procedure questions |
| 2 | Submodule | 86 | 107K | Chunks grouped by submodule — cross-chunk synthesis within a feature |
| 3 | Module | 8 | 107K | All chunks per module — high-level architectural questions |
| 4 | Vertical stack | 10 | 13K | Same feature at different depths (overview → detail → API) |
| 5 | Horizontal siblings | 16 | 18K | Parallel features at same depth — comparison questions |
| 6 | Cross-module | 6 | 7K | Related features across different modules |
| 7 | Overview + detail | 18 | 18K | Module overview paired with specific submodule details |
| 8 | Data flow chain | 4 | 7K | Sequential process chains (e.g., order → production → delivery) |
| 9 | Foundation + consumer | 10 | 6K | Base definitions paired with features that use them |
| 10 | Shared DB tables | 13 | 16K | Features sharing database tables — data integration questions |
| 11 | Module map | 4 | 3K | Full module structure maps for navigation questions |
| Total | 707 | 408K |
API model comparison: 4-way pilot
Before committing to a single API model for all 707 groups, a controlled pilot was run on the same 10 chunk groups with 4 different models:
| Model | Pairs | Question Diversity | Answer Depth | Inference Quality | Repetition |
|---|---|---|---|---|---|
| Claude Sonnet 4 | 99 | Low — mostly factual | Adequate | Shallow | Some |
| GPT-5.2 | 101 | Moderate | Good | Good | Minimal |
| Claude Opus 4.6 | 100 | Good | Excellent | Very good | Minimal |
| Claude Sonnet 4.6 | 100 | Excellent | Excellent | Best — deep inference | Minimal |
V2 prompt engineering
The data generation uses a two-level prompt architecture:
1. SFT system prompt (embedded in every training example, 38 tokens):
ERP sistemi asistanısın. Verilen bağlam bilgilerini kullanarak soruyu yanıtla. Bağlamda cevap yoksa "Bu konuda bilgim yok" de.
- Proper Turkish characters (ı, ş, ü, ö, ç, ğ) — v1’s
SYSTEM_TRused ASCII approximations, fixed here - No audience restriction — model answers both office workers and technical users
- No format instructions — the model learns formatting from examples, not from the system prompt
- Built-in grounding: “if not in context, say I don’t know”
- 38 tokens is ultra-short — critical for a 2048-token context budget
2. API system prompt (sent to Claude Sonnet 4.6, not seen by the model):
A detailed Turkish-language instruction set covering:
- 6 core rules: context-only answers, natural questions, complete/correct, use technical terms as-is, independent pairs, no repetition
- 5 question types: olgusal (factual), prosüdürel (procedural), karşılaştırmalı (comparative), çıkarımsal (inferential), liste (list)
- 4 answer formats: numbered steps for procedures, short paragraphs for definitions, bullet lists for enumerations, concise explanations for comparisons
- 3 difficulty levels: kolay ~40% (single-fact lookup), orta ~40% (multi-fact synthesis), zor ~20% (inference/comparison)
- Dynamic pair count: 5–28 pairs per group based on input token count
V2 generation: final results
| Metric | Value |
|---|---|
| Groups processed | 707 / 707 (100%) |
| QA pairs generated | 7,595 |
| Invalid pairs | 0 |
| Parse errors (retried) | 2 (both resolved on retry) |
| Avg / Min / Max pairs per group | 10.7 / 5 / 29 |
Question type distribution
| Type | Count | Percentage |
|---|---|---|
| Olgusal (factual) | 3,850 | 50.7% |
| Çıkarımsal (inferential) | 1,164 | 15.3% |
| Liste (list) | 999 | 13.2% |
| Prosüdürel (procedural) | 896 | 11.8% |
| Karşılaştırmalı (comparative) | 686 | 9.0% |
Difficulty distribution
Pairs by grouping rule
| # | Rule | Groups | Pairs | Avg/Group |
|---|---|---|---|---|
| 1 | Individual | 532 | 4,349 | 8.2 |
| 2 | Submodule | 86 | 1,445 | 16.8 |
| 3 | Module | 8 | 228 | 28.5 |
| 4 | Vertical stack | 10 | 203 | 20.3 |
| 5 | Horizontal siblings | 16 | 309 | 19.3 |
| 6 | Cross-module | 6 | 120 | 20.0 |
| 7 | Overview + detail | 18 | 351 | 19.5 |
| 8 | Data flow chain | 4 | 99 | 24.8 |
| 9 | Foundation + consumer | 10 | 150 | 15.0 |
| 10 | Shared DB tables | 13 | 271 | 20.8 |
| 11 | Module map | 4 | 70 | 17.5 |
| Total | 707 | 7,595 | 10.7 |
| Metric | V1 | V2 |
|---|---|---|
| API model | Claude Opus 4.5 | Claude Sonnet 4.6 |
| Chunks processed | 320 / 529 (60%) | 532 / 532 (100%) |
| Grouping rules | 1 (individual only) | 11 rules |
| Total groups | 320 | 707 |
| QA pairs | 3,790 | 7,595 |
| Question types | Mixed, unstructured | 5 types, difficulty-graded |
| RAG context in training | No | Yes — every example |
| Output file | sft_data/erp_qa_pairs.jsonl | erp_rag/data/sft_raw_pairs.json (44.9 MB) |
| ChatML file | sft_data/erp_sft_chatml.jsonl (2.64 MB) | erp_rag/data/sft_train.jsonl (45.8 MB) |
14. REPRODUCIBILITY
Complete file inventory
V1 SFT files:
| File | Purpose | Size |
|---|---|---|
scripts/generate_erp_qa.py | V1 QA generation (Claude Opus 4.5) | ~12 KB |
sft_data/erp_qa_pairs.jsonl | V1 raw QA pairs | ~3 MB |
sft_data/erp_sft_chatml.jsonl | V1 ChatML training data | 2.64 MB |
tiny_llm/train_sft.py | V1 SFT training loop | ~14 KB |
tiny_llm/checkpoints/sft/sft_best.pt | V1 best SFT checkpoint (epoch 3) | 94 MB |
V2 SFT files:
| File | Purpose | Size |
|---|---|---|
erp_rag/generate/sft_generate.py | V2 data generation pipeline (Claude Sonnet 4.6) | ~18 KB |
erp_rag/data/sft_chunk_groups.json | Master grouping blueprint (707 groups, 11 rules) | ~400 KB |
erp_rag/data/sft_raw_pairs.json | V2 raw QA pairs with context | 44.9 MB |
erp_rag/data/sft_train.jsonl | V2 ChatML training data (7,595 examples) | 45.8 MB |
tiny_llm/train_sft_rag.py | V2 RAG-grounded SFT training loop | ~20 KB |
Shared files:
| File | Purpose | Size |
|---|---|---|
tiny_llm/sft_data.py | Tokenization and assistant-only loss masking | ~4 KB |
tiny_llm/chat.py | Interactive chat with SFT model | ~5 KB |
erp_rag/data/chunks/all_chunks.json | Pre-processed ERP documentation (1,074 chunks) | 1.9 MB |
To reproduce
# V2 pipeline (recommended) # 1. Generate QA pairs (requires Anthropic API key, ~$20-25) pip install anthropic export ANTHROPIC_API_KEY="sk-ant-..." python -m erp_rag.generate.sft_generate \ --provider anthropic --model claude-sonnet-4-6 # 707 groups → 7,595 pairs, auto-resume on interruption # 2. Upload data to RunPod and run SFT training python -m tiny_llm.train_sft_rag \ --model v2 \ --data erp_rag/data/sft_raw_pairs.json # reads raw pairs, builds ChatML, trains with loss masking # 3. Chat with the model python -m tiny_llm.chat
Experiments tried & decisions locked
| Experiment | What Was Tried | Result | Decision |
|---|---|---|---|
| SFT on local Mac (MPS) | Ran train_sft.py on M4 MacBook |
CPU hit 93°C; killed immediately | RunPod H100 only — locked |
| torch.compile checkpoint loading | Loaded step_228000.pt directly into non-compiled model | All keys mismatched (_orig_mod. prefix); loss ~20.0 |
Always strip _orig_mod. prefix — locked |
| V1: Claude Opus 4.5 | 320/529 chunks processed before API credits exhausted ($20) | 3,790 QA pairs (3,170 single + 620 multi-turn) | Superseded by V2 pipeline |
| V2: Claude Sonnet 4.6 | 707/707 groups, 11 rules, 532 chunks, ~$20–25 | 7,595 QA pairs, 0 invalid, 5 types, 3 difficulty levels | V2 pipeline — locked. Production-ready SFT data. |
| SFT dropout 0.05 | Increased from pretraining’s 0.0 | Val loss improved steadily across 3 epochs; no overfitting | dropout 0.05 for SFT — locked |
| LR 2e-5 for SFT | 10× lower than pretraining peak (3e-4) | Stable training; val loss 5.12 → 3.10 across 3 epochs | LR 2e-5 for SFT — locked |
| 3 SFT epochs | Trained for 3 full epochs over 3,790 conversations | Best val loss at epoch 3 (3.10); still improving | Could train more epochs, but model already follows instructions |
| Fresh optimizer for SFT | New AdamW (did not carry pretraining optimizer state) | Correct approach: SFT loss surface differs from pretraining | Fresh optimizer for SFT — locked |
| Loss masking (assistant-only) | IGNORE=-1 for system/user tokens; only assistant+EOT contribute to loss | Model learns to generate answers, not memorize questions | Assistant-only loss — locked |
© 2026 • Independent Research