Back to Research

SUPERVISED FINE-TUNING (SFT)

Phase 4 — From 3,790 to 10,000 QA Pairs: Two Generations of ERP Assistant Training

February 2026 • Independent Research • IN PROGRESS

3,790

SFT TRAINING EXAMPLES

39.5%

VAL LOSS IMPROVEMENT

100s

TOTAL TRAINING TIME

639

TRAINING STEPS

Abstract. This report documents two generations of supervised fine-tuning (SFT) for Turkish ERP assistants built from scratch. V1 (24.7M params, 512-token context) used Claude Opus 4.5 to generate 3,790 QA pairs from 320 ERP documentation chunks, proving that SFT works: validation loss dropped from 5.12 to 3.10 in 100 seconds on an H100 GPU. V2 (67.6M params, 2048-token context) redesigns the entire pipeline: 11 grouping strategies produce 707 chunk configurations from 532 source chunks; Claude Sonnet 4.6 (selected via 4-way API comparison) generates ~8,000–10,000 diverse QA pairs with factual, procedural, comparative, inferential, and list-type questions at three difficulty levels. Every training example includes the source context chunk, training the model as a RAG context converter rather than a knowledge memorizer. Sections 1–12 document the completed v1 experiment; Section 13 documents the v2 redesign currently in progress.

1. The Full Pipeline 2. Pretraining History 3. ERP Documentation Source 4. Synthetic Data Generation (V1) 5. Prompt Engineering (V1) 6. V1 QA Pair Statistics 7. Chat Template & Tokenization 8. Loss Masking Strategy 9. V1 SFT Training Configuration 10. V1 Results 11. V1 Model Outputs 12. Analysis & Lessons Learned 13. V2 SFT Pipeline Redesign 14. Reproducibility

1. THE FULL PIPELINE

The model follows a standard modern LLM training pipeline. Each phase builds on the previous one, progressively narrowing the model’s capabilities from general language understanding to domain-specific instruction following.

Two generations of SFT are documented here: v1 (24.7M, proof of concept) and v2 (67.6M, production pipeline).

TOKENIZER (64K BPE)

→

V1: 24.7M (R1+R2)

→

V1 SFT (3,790 pairs)

→

V2: 67.6M (2048 ctx)

→

V2 SFT (~10K pairs)

→

DEPLOY

Phase	Purpose	Data	Result
Tokenizer	Efficient Turkish text encoding	22 GB, 11 domains	64K vocab, 2.7× vs GPT-4
V1 Architecture	Initial model design	—	24.7M params, ALiBi/GQA/SwiGLU
Pretrain R1+R2	Basic → deep Turkish	22 GB corpus	278K steps, loss 3.46
V1 SFT	ERP domain proof-of-concept	3,790 QA pairs (Opus 4.5)	639 steps, val loss 5.12 → 3.10
Pretrain R2.5	2048-token context for RAG	22 GB corpus	228K steps, loss 3.22 (best)
V2 Architecture	RAG-optimized model	—	67.6M params, d_model=512, 2:1 GQA
V2 Pretrain	Full 67.6M pretraining	22 GB corpus	IN PROGRESS
V2 SFT data	RAG-grounded ERP assistant	7,595 pairs (Sonnet 4.6)	COMPLETE
RL (optional)	DPO / RLVR if SFT insufficient	TBD	FUTURE

2. PRETRAINING HISTORY

The v1 model went through three pretraining rounds before SFT. Round 2 (shown below) established deep language understanding. Round 2.5 later extended context to 2048 tokens for RAG. The v2 model (67.6M) is a separate architecture pretrained from scratch. Full pretraining details are in the Architecture & Pretraining report.

228K

TOTAL STEPS

14.9B

TOKENS PROCESSED

10.5h

TRAINING TIME

3.46

FINAL LOSS

3.39

BEST LOSS

Loss curve progression

Step	Loss	LR	Tok/s	Sample Quality
50,000 (R1)	2.62	3.0e-04	~16K (MPS)	Basic Turkish grammar
95,000	~3.60	2.0e-04	~399K (H100)	Simple sentences, some repetition
145,000	~3.50	1.1e-04	~401K	Coherent sentences with context
195,000	~3.47	4.4e-05	~402K	Factual content: “Kocaeli’ndeyiz”
228,000	3.46	3.0e-05	~403K	Real-world knowledge, correct grammar

Note on loss difference: Round 1 loss (2.62) and Round 2 loss (3.46) are not directly comparable. Round 1 trained on a ~500MB curated subset; Round 2 trained on 22GB of diverse text spanning 11 domains. The higher absolute loss reflects the much harder prediction task across legal, medical, financial, news, and literary text — not a regression in model quality. Sample outputs confirm dramatically improved language understanding.

Sample evolution during Round 2

Step 95,000

“Merhaba deyin! Bu e-postayı seviyorum! Bu e-postanın amacı, bu e-postanın ana no…”

Step 145,000

“Türkiye Türkiye’de ve dünyada ekonomik gelişmeler açısından büyük önem taşımaktadır”

Step 195,000

“Türkiye’nin en büyük ikinci sanayi şehri konumundaki Kocaeli’ndeyiz. En önemli t…”

Step 200,000 — Best loss 3.39

“Türkiye’de tüm dünyada ‘insan hakları’ndan söz edildiği gibi, Türkiye’de de, ulu…”

Scaling law observation: At 26M parameters, the model is capacity-limited, not data-limited. Chinchilla-optimal training for 26M params is ~520M tokens (20× params), but the model processed 14.9B tokens (~573× params). Loss was still decreasing at step 228K but with diminishing returns — the model had exhausted most of its representational capacity.

What happened next: Round 2.5 retrained with max_seq_len=2048 and dropout=0.02 for RAG compatibility, achieving best loss 3.22 in 26.5 hours. The v2 architecture (67.6M, d_model=512, 2:1 GQA) was then designed to break the capacity ceiling — 4.2× more transformer parameters. See Section 15 and Section 16 of the Architecture report.

3. ERP DOCUMENTATION SOURCE

The SFT training data was generated from the Solen Kablo ERP system documentation — the same system the model is being built to assist with. The documentation was already pre-processed into structured chunks with rich metadata as part of a RAG (Retrieval-Augmented Generation) pipeline.

1,074

TOTAL CHUNKS

ERP MODULES

215K

TOTAL TOKENS

202

AVG TOKENS/CHUNK

Modules covered

Module	Description	Chunks (TR)	Chunks (EN)
Admin	User management, roles, authentication, system settings	~80	~80
Hammadde	Raw materials: purchase orders, QR tracking, supplier management	~90	~85
Stok	Inventory: warehouse management, stock levels, movements	~85	~85
Teknik	Cable database: specifications, standards, production recipes	~95	~90
Lab	Quality control: test procedures, measurements, certificates	~60	~55
Üretim	Production: work orders, machine management, scheduling	~50	~50
Satış	Sales: customer orders, quotations, delivery tracking	~40	~45
Finans	Finance: invoicing, payments, cost analysis	~32	~52
Total		532	542

Chunk metadata

Each chunk contains structured metadata used to guide the QA generation:

Field	Purpose	Example
`chunk_id`	Unique identifier for resume capability	`erp-mod-hammadde-tr_chunk_042`
`module`	ERP module name	`Hammadde (Raw Materials Management)`
`section_heading`	Documentation section	`Sipariş Yönetimi`
`breadcrumb`	Navigation path	`Hammadde > Siparişler > Yeni Sipariş`
`language`	Source language	`tr` or `en`
`token_count`	Chunk size (for filtering)	`186`
`has_table`	Contains tabular data	`true`
`has_code`	Contains code/API references	`false`
`references_modules`	Cross-module references	`["Stok", "Üretim"]`

4. SYNTHETIC DATA GENERATION (V1)

The central challenge of SFT for a domain-specific model is data. Hand-writing thousands of QA pairs is impractical; using generic instruction datasets would not teach the model about Solen Kablo’s ERP system. The solution: use a large cloud LLM as a data conversion tool — not a knowledge source — to transform existing ERP documentation into training-ready QA pairs.

V1 vs V2 pipeline. This section describes the v1 approach (Claude Opus 4.5, individual chunks, 3,790 pairs). The v2 pipeline (Claude Sonnet 4.6, 11 grouping rules, ~10K pairs) is documented in Section 13.

ERP Docs (HTML)

→

Chunk + Metadata

→

Claude Opus 4.5

→

QA Pairs (JSONL)

→

ChatML Format

V1 design decisions

Decision	V1 Choice	V2 Change
LLM	Claude Opus 4.5	Claude Sonnet 4.6 (better question diversity)
Grouping	Individual chunks only	11 rules: individual, submodule, module, cross-module, etc.
Target model	24.7M, 512 context	67.6M, 2048 context
Focus	User-centric questions	Same + technical questions (no audience restriction)
Format	Single + multi-turn (70/30)	Single-turn only (RAG: one question, one answer)
Difficulty	Graded, length-calibrated for 26M	Graded, max 200 words (uncapped format)
Volume	3,790 pairs from 320 chunks	~8–10K pairs from 707 groups

Anti-hallucination rule. The LLM was explicitly instructed: “HER cevap DOĞRUDAN verilen metin parçasından gelmeli. Metinde OLMAYAN bilgi EKLEME — hiçbir şey uydurma.” (Every answer must come DIRECTLY from the given text chunk. Do NOT add information that is NOT in the text — make nothing up.) This ensures the training data contains only verified information from the actual ERP documentation.

5. PROMPT ENGINEERING (V1)

The v1 generation pipeline uses a two-part prompt: a comprehensive system prompt (the “Grand Prompt”) that defines the task, formats, rules, and examples; and a per-chunk user message that provides metadata and the documentation text. The v2 prompt is significantly redesigned — see Section 13.

System prompt structure

Section	Purpose
SİSTEM HAKKINDA	Context about the ERP: 8 modules, cable factory, user types
TEK GÖREVİN	Single task definition: convert text to user-focused QA
HEDEF MODEL HAKKINDA	26M parameter constraints: concise answers, no long paragraphs
İKİ TİP VERİ	Output format: single-turn and multi-turn JSON schemas
ZORLUK SEVİYELERİ	Difficulty definitions: kolay (1-2 sent), orta (2-4), zor (4-6)
ODAK NOKTASI	User-centric focus: “How do I use the system?” not code details
KRİTİK KURALLAR	12 rules including anti-hallucination, translation, coverage
ÖRNEKLER	Good/bad examples showing desired vs undesired output style

Answer length calibration for v1 (26M parameters)

Difficulty	Target Length	Question Type	Example
Kolay (Easy)	1–2 sentences	Single fact: “X nedir?”	“QR kod ne işe yarar?”
Orta (Medium)	2–4 sentences	Process/steps: “X nasıl yapılır?”	“Sisteme yeni bakır girişi nasıl yapılır?”
Zor (Hard)	4–6 sentences	Multi-fact/scenario	“Sipariş tarihi değişirse ve kısmi teslimat yapılmışsa ne yapmalıyım?”

Why calibrate length? (V1) A 26M parameter model cannot reliably generate long, complex answers. By constraining answer lengths during data generation, the model learns patterns it can actually reproduce. The v2 pipeline relaxes this for the 67.6M model: answers can be up to 200 words with flexible formatting (numbered steps, paragraphs, or lists), learned from examples rather than rigid templates.

Bad vs good question examples (from the prompt)

Type	Question	Problem / Reason
BAD	“raw_materials tablosunun sütunları nelerdir?”	Too technical — DB schema, not user question
BAD	“POST /api/materials endpoint’i ne döndürür?”	API detail — users don’t know endpoints
GOOD	“QR kod ne işe yarar?”	User-centric, natural language
GOOD	“Sisteme yeni bakır girişi nasıl yapılır?”	Process-focused, practical
GOOD	“Operatör kalay teslim aldığında ne yapmalı?”	Scenario-based, role-aware

6. V1 QA PAIR STATISTICS

The v1 generation script processed 320 out of 529 Turkish chunks before API credits were exhausted (~$20 on Claude Opus 4.5). The resulting dataset was sufficient for the 26M parameter model. For v2 statistics (~8–10K pairs from 707 groups), see Section 13.

3,790

TOTAL QA ITEMS

3,170

SINGLE-TURN (84%)

620

MULTI-TURN (16%)

11.7

AVG ITEMS PER CHUNK

Difficulty distribution

1,451

KOLAY / EASY (38%)

1,593

ORTA / MEDIUM (42%)

746

ZOR / HARD (20%)

Generation cost

Metric	Value
Model used	Claude Opus 4.5 (`claude-opus-4-5-20251101`)
Chunks processed	320 / 529 Turkish chunks (61%)
API cost	~$20
Generation rate	~24 items/minute
Items per chunk	~11.7 average
Output file	`sft_data/erp_qa_pairs.jsonl`
ChatML file	`sft_data/erp_sft_chatml.jsonl` (2.64 MB)

7. CHAT TEMPLATE & TOKENIZATION

The SFT data uses a Llama 3–style chat template, leveraging the special tokens already built into the tokenizer during Phase 1. This was a deliberate design decision: the tokenizer was built with instruction-tuning tokens before the model existed, anticipating this exact use case.

Special tokens used

Token	ID	Role in Chat Template
`<\|begin_of_text\|>`	0	Start of conversation
`<\|start_header_id\|>`	4	Opens role header (system/user/assistant)
`<\|end_header_id\|>`	5	Closes role header
`<\|eot_id\|>`	6	End of turn marker
`<\|pad\|>`	2	Padding for batched training

Template structure

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Sen Solen Kablo ERP sisteminin yapay zeka asistanısın...
<|eot_id|><|start_header_id|>user<|end_header_id|>

QR kod ne işe yarar?
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

QR kod her hammaddeye atanan benzersiz bir takip kodudur...
<|eot_id|>

Multi-turn extension

For multi-turn conversations (2–3 turns), the template simply repeats the user/assistant blocks. Each turn ends with <|eot_id|>, and the model learns to generate until it produces this token.

8. LOSS MASKING STRATEGY

A critical detail of SFT: loss is computed only on assistant tokens. The model must learn to predict assistant responses, but it should not be penalized for failing to predict the system prompt or user questions (which are given as input, not generated).

Token-level loss mask

Token sequence:
[BOS] [HEADER:system] system content... [EOT]
      [HEADER:user]   user question... [EOT]
      [HEADER:asst]   assistant answer... [EOT]

Loss mask:
  -1    -1  -1  -1  -1  ...  -1     ← system (ignored)
  -1    -1  -1  -1  -1  ...  -1     ← user (ignored)
  -1    -1  YES YES YES ... YES    ← assistant content + EOT (trained)

The ignore_index=-1 parameter in PyTorch’s cross_entropy function handles this natively. Positions marked -1 contribute zero to the loss. Only the assistant’s content tokens and the trailing <|eot_id|> are trained on.

Why mask the header too? Even the <|start_header_id|>assistant<|end_header_id|>\n\n prefix is masked. The model doesn’t need to learn to predict the role header — it will always be provided as part of the prompt template. Training only on content maximizes the signal-to-noise ratio for any model’s parameter budget. This strategy is used for both v1 (26M) and v2 (67.6M).

9. V1 SFT TRAINING CONFIGURATION

Parameter	Value	Rationale
Base checkpoint	step_228000.pt	Best available pretrained model
Epochs	3	Small dataset benefits from multiple passes
Batch size	8 × 2 accum = 16	Effective batch of 16 conversations
Learning rate	2e-5 → 2e-6	~15× lower than pretraining (3e-4)
LR schedule	Cosine with warmup	63 warmup steps (10% of total)
Dropout	0.05	Regularization for small dataset (was 0.0 in pretraining)
Weight decay	0.01	Lower than pretraining (0.1)
Gradient clip	1.0	Stability
Optimizer	AdamW (fresh)	New optimizer state, not resumed from pretraining
Validation split	10% (379 conversations)	Held-out evaluation after each epoch

3,411

TRAIN CONVERSATIONS

379

VALIDATION CONVERSATIONS

141K

TRAINABLE TOKENS (ASSISTANT ONLY)

Why a fresh optimizer? The pretrained model’s AdamW momentum was tuned for next-token prediction on raw Turkish text. SFT is a fundamentally different task (instruction following) with a much smaller learning rate. Reusing the old optimizer state would inject stale momentum from pretraining, potentially causing instability. A fresh optimizer allows clean adaptation.

10. V1 RESULTS

5.12

VAL LOSS BEFORE SFT

3.10

VAL LOSS AFTER SFT

2.01

ABSOLUTE IMPROVEMENT

39.5%

RELATIVE IMPROVEMENT

Per-epoch breakdown

Epoch	Train Loss	Val Loss	Best?	Time
Before SFT	—	5.12		0s
Epoch 1	8.53	3.34	★	28s
Epoch 2	6.55	3.14	★	60s
Epoch 3	6.07	3.10	★ Best	100s

Train loss is higher than val loss because the training loss is computed per-batch during gradient updates (with dropout active), while validation loss is computed over the full validation set with dropout disabled.

Key result: In 100 seconds of SFT training on an H100 GPU, the model went from producing incoherent repetitions (“Digital program Digital program…”) to generating domain-relevant Turkish answers about the ERP system. The validation loss was still decreasing at epoch 3, suggesting more epochs could yield further improvement.

11. V1 MODEL OUTPUTS

Before SFT (pretrained only)

BEFORE SFT — Question: “QR kod ne işe yarar?”

“Digital program Digital program Digital program Digital program Digital program…”

The pretrained model has no concept of the chat template, question-answering, or ERP knowledge. Given the structured prompt, it falls into degenerate repetition.

After SFT (epoch 3)

AFTER SFT — Question: “QR kod ne işe yarar?”

S: QR kod ne işe yarar?

“QR kod, bir malzemenin hangi malzeme türüne ait olduğunu takip eder.”

AFTER SFT — Question: “Sisteme yeni bakır girişi nasıl yapılır?”

S: Sisteme yeni bakır girişi nasıl yapılır?

“Sistem bakır için yeni bakır girişini otomatik olarak hesaplar. Bu sayede bakırın tüm üretim süreci takip edilir.”

AFTER SFT — Question: “Hammadde modülünde neler yapabilirim?”

S: Hammadde modülünde neler yapabilirim?

“Hammadde modülünde farklı hammaddeler için malzemeler listesi oluşturulur.”

AFTER SFT — Question: “Satış siparişi nasıl oluşturulur?”

S: Satış siparişi nasıl oluşturulur?

“Satış siparişi otomatik olarak açılır menüde gösterilir. Bu menüde sipariş detayları, tarih ve saat alanları, teslim edilen miktar, teslimat tarihi alanlarına ağırlık ayarlarıyla birlikte gösterilir.”

Observations: The model (1) speaks coherent Turkish with correct grammar and suffixes, (2) stays on-topic about ERP operations, (3) uses domain vocabulary correctly (malzeme, sipariş, teslimat, modül), and (4) follows the question-answer pattern learned during SFT. Answers are concise, matching the difficulty-calibrated length targets. Some answers remain somewhat generic — expected after only 3 epochs of SFT on a 26M model.

12. ANALYSIS & LESSONS LEARNED

What v1 SFT proved

Turkish grammar and morphology are solid — learned during pretraining, preserved through SFT
Domain vocabulary (hammadde, sipariş, tedarikçi, malzeme) is correctly used
Chat template pattern (system → user → assistant) is reliably followed
The model stops generating at the right point (produces <|eot_id|>)
SFT on a tiny model is fast (<2 minutes) and cheap (<$0.10), enabling rapid iteration

V1 limitations that drove the v2 redesign

V1 Limitation	Root Cause	V2 Solution
Repetitive outputs	26M model capacity + no dropout + 512-token context	67.6M model + dropout 0.02 + 2048 context
Generic, ungrounded answers	No RAG context in training — model guesses from memorized patterns	Every training example includes the source chunk as context
Only 60% chunk coverage	API credits exhausted at 320/529 chunks	All 532 chunks processed via 707 groups across 11 rules
Single-chunk questions only	No grouping strategy — each chunk processed individually	11 grouping rules (submodule, cross-module, data flow, etc.)
No audience diversity	V1 prompt targeted “office worker” questions only	V2 generates both user-level and technical questions
Sonnet 4 / Opus 4.5 API	Good but not optimal for Turkish Q&A diversity	Sonnet 4.6 selected via 4-way comparative pilot

Key insight: SFT is the critical phase at this scale

For models under ~1B parameters, SFT training data quality is the single most important factor. RL techniques (DPO, RLHF, RLVR) provide diminishing returns at this scale because the model lacks the capacity to benefit from fine-grained preference signals. Instead of chasing RL, the strategy is: (1) maximize SFT data quality via better prompts, diverse grouping, and superior API models; (2) improve the retrieval pipeline so the model sees better context at inference time; (3) iterate on the RAG prompt engineering for the deployed system. RL remains an optional future phase if SFT alone proves insufficient.

Experiment: intentional wrong answers in SFT. An earlier experiment injected incorrect answers that self-corrected mid-response (e.g., “wait, this is wrong…”). Result: overall model quality degraded significantly. Never inject bad answers into SFT data. If preference learning is needed, use DPO with separate (chosen, rejected) pairs instead.

13. V2 SFT PIPELINE REDESIGN

The v1 SFT (3,790 pairs from Claude Opus 4.5) proved the concept but exposed limitations: only 320 of 529 chunks were processed before API credits ran out, the grouping strategy was basic (individual chunks only), and the 512-token context couldn’t fit real RAG prompts. The v2 pipeline was redesigned from scratch for the 67.6M RAG-optimized model.

707

CHUNK GROUPS

GROUPING RULES

532

SOURCE CHUNKS

408K

INPUT TOKENS

7,595

QA PAIRS GENERATED

Why redesign?

Coverage gap: v1 only processed 60% of chunks. v2 processes all 532 chunks in multiple configurations.
Single-chunk limitation: v1 generated Q&A from individual chunks only. v2 uses 11 grouping rules that combine chunks by submodule, module, cross-module, data flow, and more — producing questions that require synthesizing information across chunks.
Model upgrade: v2 targets a 67.6M model with 2048-token context, enabling much richer RAG prompts with longer context chunks.
API model quality: v1 used Claude Opus 4.5. A 4-way comparison showed Claude Sonnet 4.6 produces superior question diversity and inference depth.

Master grouping strategy: 11 rules

Each rule produces a different perspective on the same documentation, forcing diverse question types:

#	Rule	Groups	Tokens	Description
1	Individual	532	107K	Each chunk alone — factual, definition, basic procedure questions
2	Submodule	86	107K	Chunks grouped by submodule — cross-chunk synthesis within a feature
3	Module	8	107K	All chunks per module — high-level architectural questions
4	Vertical stack	10	13K	Same feature at different depths (overview → detail → API)
5	Horizontal siblings	16	18K	Parallel features at same depth — comparison questions
6	Cross-module	6	7K	Related features across different modules
7	Overview + detail	18	18K	Module overview paired with specific submodule details
8	Data flow chain	4	7K	Sequential process chains (e.g., order → production → delivery)
9	Foundation + consumer	10	6K	Base definitions paired with features that use them
10	Shared DB tables	13	16K	Features sharing database tables — data integration questions
11	Module map	4	3K	Full module structure maps for navigation questions
	Total	707	408K

Distribution note. Rule 1 (individual chunks) produces 75% of all groups (532/707) and ~80% of expected QA pairs. This is intentional: the majority of real user queries will be answerable from a single retrieved chunk. The remaining 10 rules (175 groups, ~20% of pairs) train the model on harder multi-chunk reasoning that occurs when the retriever returns related but separate passages.

API model comparison: 4-way pilot

Before committing to a single API model for all 707 groups, a controlled pilot was run on the same 10 chunk groups with 4 different models:

Model	Pairs	Question Diversity	Answer Depth	Inference Quality	Repetition
Claude Sonnet 4	99	Low — mostly factual	Adequate	Shallow	Some
GPT-5.2	101	Moderate	Good	Good	Minimal
Claude Opus 4.6	100	Good	Excellent	Very good	Minimal
Claude Sonnet 4.6	100	Excellent	Excellent	Best — deep inference	Minimal

Winner: Claude Sonnet 4.6. Sonnet 4.6 produced the most diverse question types (factual, procedural, comparative, inferential, list) with the deepest inference-based questions. It consistently generated questions that require combining information from multiple parts of the context, rather than simple lookups. Sonnet 4 was notably weaker — its questions were almost entirely factual with shallow answers. Opus 4.6 was close but slightly less diverse.

V2 prompt engineering

The data generation uses a two-level prompt architecture:

1. SFT system prompt (embedded in every training example, 38 tokens):

ERP sistemi asistanısın. Verilen bağlam bilgilerini kullanarak
soruyu yanıtla. Bağlamda cevap yoksa "Bu konuda bilgim yok" de.

System prompt design decisions:

Proper Turkish characters (ı, ş, ü, ö, ç, ğ) — v1’s SYSTEM_TR used ASCII approximations, fixed here
No audience restriction — model answers both office workers and technical users
No format instructions — the model learns formatting from examples, not from the system prompt
Built-in grounding: “if not in context, say I don’t know”
38 tokens is ultra-short — critical for a 2048-token context budget

2. API system prompt (sent to Claude Sonnet 4.6, not seen by the model):

A detailed Turkish-language instruction set covering:

6 core rules: context-only answers, natural questions, complete/correct, use technical terms as-is, independent pairs, no repetition
5 question types: olgusal (factual), prosüdürel (procedural), karşılaştırmalı (comparative), çıkarımsal (inferential), liste (list)
4 answer formats: numbered steps for procedures, short paragraphs for definitions, bullet lists for enumerations, concise explanations for comparisons
3 difficulty levels: kolay ~40% (single-fact lookup), orta ~40% (multi-fact synthesis), zor ~20% (inference/comparison)
Dynamic pair count: 5–28 pairs per group based on input token count

V2 generation: final results

7,595

TOTAL QA PAIRS

707/707

GROUPS COMPLETE

10.7

AVG PAIRS/GROUP

INVALID PAIRS

44.9 MB

RAW DATA SIZE

Metric	Value
Groups processed	707 / 707 (100%)
QA pairs generated	7,595
Invalid pairs	0
Parse errors (retried)	2 (both resolved on retry)
Avg / Min / Max pairs per group	10.7 / 5 / 29

Question type distribution

Type	Count	Percentage
Olgusal (factual)	3,850	50.7%
Çıkarımsal (inferential)	1,164	15.3%
Liste (list)	999	13.2%
Prosüdürel (procedural)	896	11.8%
Karşılaştırmalı (comparative)	686	9.0%

Difficulty distribution

3,791

KOLAY / EASY (49.9%)

2,629

ORTA / MEDIUM (34.6%)

1,175

ZOR / HARD (15.5%)

Pairs by grouping rule

#	Rule	Groups	Pairs	Avg/Group
1	Individual	532	4,349	8.2
2	Submodule	86	1,445	16.8
3	Module	8	228	28.5
4	Vertical stack	10	203	20.3
5	Horizontal siblings	16	309	19.3
6	Cross-module	6	120	20.0
7	Overview + detail	18	351	19.5
8	Data flow chain	4	99	24.8
9	Foundation + consumer	10	150	15.0
10	Shared DB tables	13	271	20.8
11	Module map	4	70	17.5
	Total	707	7,595	10.7

V1 → V2 final comparison. V1 produced 3,790 pairs from 320 individual chunks using Claude Opus 4.5. V2 produced 7,595 pairs from 707 multi-configuration groups using Claude Sonnet 4.6 — a 2.0× increase in data volume with substantially higher question diversity (11 grouping rules vs 1, 5 question types, 3 difficulty levels).

Metric	V1	V2
API model	Claude Opus 4.5	Claude Sonnet 4.6
Chunks processed	320 / 529 (60%)	532 / 532 (100%)
Grouping rules	1 (individual only)	11 rules
Total groups	320	707
QA pairs	3,790	7,595
Question types	Mixed, unstructured	5 types, difficulty-graded
RAG context in training	No	Yes — every example
Output file	`sft_data/erp_qa_pairs.jsonl`	`erp_rag/data/sft_raw_pairs.json` (44.9 MB)
ChatML file	`sft_data/erp_sft_chatml.jsonl` (2.64 MB)	`erp_rag/data/sft_train.jsonl` (45.8 MB)

14. REPRODUCIBILITY

Complete file inventory

V1 SFT files:

File	Purpose	Size
`scripts/generate_erp_qa.py`	V1 QA generation (Claude Opus 4.5)	~12 KB
`sft_data/erp_qa_pairs.jsonl`	V1 raw QA pairs	~3 MB
`sft_data/erp_sft_chatml.jsonl`	V1 ChatML training data	2.64 MB
`tiny_llm/train_sft.py`	V1 SFT training loop	~14 KB
`tiny_llm/checkpoints/sft/sft_best.pt`	V1 best SFT checkpoint (epoch 3)	94 MB

V2 SFT files:

File	Purpose	Size
`erp_rag/generate/sft_generate.py`	V2 data generation pipeline (Claude Sonnet 4.6)	~18 KB
`erp_rag/data/sft_chunk_groups.json`	Master grouping blueprint (707 groups, 11 rules)	~400 KB
`erp_rag/data/sft_raw_pairs.json`	V2 raw QA pairs with context	44.9 MB
`erp_rag/data/sft_train.jsonl`	V2 ChatML training data (7,595 examples)	45.8 MB
`tiny_llm/train_sft_rag.py`	V2 RAG-grounded SFT training loop	~20 KB

Shared files:

File	Purpose	Size
`tiny_llm/sft_data.py`	Tokenization and assistant-only loss masking	~4 KB
`tiny_llm/chat.py`	Interactive chat with SFT model	~5 KB
`erp_rag/data/chunks/all_chunks.json`	Pre-processed ERP documentation (1,074 chunks)	1.9 MB

To reproduce

# V2 pipeline (recommended)

# 1. Generate QA pairs (requires Anthropic API key, ~$20-25)
pip install anthropic
export ANTHROPIC_API_KEY="sk-ant-..."
python -m erp_rag.generate.sft_generate \
    --provider anthropic --model claude-sonnet-4-6   # 707 groups → 7,595 pairs, auto-resume on interruption

# 2. Upload data to RunPod and run SFT training
python -m tiny_llm.train_sft_rag \
    --model v2 \
    --data erp_rag/data/sft_raw_pairs.json           # reads raw pairs, builds ChatML, trains with loss masking

# 3. Chat with the model
python -m tiny_llm.chat

Experiments tried & decisions locked

DO NOT RE-TRY: These SFT experiments and configurations were already evaluated.

Experiment	What Was Tried	Result	Decision
SFT on local Mac (MPS)	Ran `train_sft.py` on M4 MacBook	CPU hit 93°C; killed immediately	RunPod H100 only — locked
torch.compile checkpoint loading	Loaded step_228000.pt directly into non-compiled model	All keys mismatched (`_orig_mod.` prefix); loss ~20.0	Always strip `_orig_mod.` prefix — locked
V1: Claude Opus 4.5	320/529 chunks processed before API credits exhausted ($20)	3,790 QA pairs (3,170 single + 620 multi-turn)	Superseded by V2 pipeline
V2: Claude Sonnet 4.6	707/707 groups, 11 rules, 532 chunks, ~$20–25	7,595 QA pairs, 0 invalid, 5 types, 3 difficulty levels	V2 pipeline — locked. Production-ready SFT data.
SFT dropout 0.05	Increased from pretraining’s 0.0	Val loss improved steadily across 3 epochs; no overfitting	dropout 0.05 for SFT — locked
LR 2e-5 for SFT	10× lower than pretraining peak (3e-4)	Stable training; val loss 5.12 → 3.10 across 3 epochs	LR 2e-5 for SFT — locked
3 SFT epochs	Trained for 3 full epochs over 3,790 conversations	Best val loss at epoch 3 (3.10); still improving	Could train more epochs, but model already follows instructions
Fresh optimizer for SFT	New AdamW (did not carry pretraining optimizer state)	Correct approach: SFT loss surface differs from pretraining	Fresh optimizer for SFT — locked
Loss masking (assistant-only)	IGNORE=-1 for system/user tokens; only assistant+EOT contribute to loss	Model learns to generate answers, not memorize questions	Assistant-only loss — locked

Cost breakdown for the entire project (Phases 1–4): Tokenizer training: free (CPU). Pretraining v1 (R1+R2+R2.5): ~$92.83 (39h × $2.38/hr). V1 SFT data generation: ~$20 (Claude Opus 4.5 API). V1 SFT training: <$0.10 (100 seconds on H100). V2 SFT data generation: ~$22 (Claude Sonnet 4.6 API, 707 calls). Total so far: ~$135 (excluding v2 pretraining).

Conclusion. Two generations of SFT have been built for this project. V1 proved the concept: 3,790 QA pairs from Claude Opus 4.5 transformed a 24.7M pretrained model into a functional ERP assistant in 100 seconds. V2 redesigns everything at scale: 11 grouping strategies produce 707 chunk groups from 532 ERP documentation chunks; Claude Sonnet 4.6 (selected via 4-way comparative pilot) generates ~8,000–10,000 diverse QA pairs spanning factual, procedural, comparative, inferential, and list-type questions at three difficulty levels. The target model (67.6M v2 with 2048-token context) was purpose-built as a RAG context converter. The entire pipeline — tokenizer, architecture, pretraining, data generation, SFT training — remains built from scratch with no pretrained weights, no HuggingFace trainers, and no off-the-shelf datasets. V2 SFT data generation is complete: 7,595 pairs, ready for training. Total project cost to date: ~$135.

SUPERVISED FINE-TUNING (SFT)

TABLE OF CONTENTS

1. THE FULL PIPELINE

2. PRETRAINING HISTORY

Loss curve progression

Sample evolution during Round 2

3. ERP DOCUMENTATION SOURCE

Modules covered

Chunk metadata

4. SYNTHETIC DATA GENERATION (V1)

V1 design decisions

5. PROMPT ENGINEERING (V1)

System prompt structure

Answer length calibration for v1 (26M parameters)

Bad vs good question examples (from the prompt)

6. V1 QA PAIR STATISTICS

Difficulty distribution

Generation cost

7. CHAT TEMPLATE & TOKENIZATION

Special tokens used

Template structure

Multi-turn extension

8. LOSS MASKING STRATEGY

Token-level loss mask

9. V1 SFT TRAINING CONFIGURATION

10. V1 RESULTS

Per-epoch breakdown

11. V1 MODEL OUTPUTS

Before SFT (pretrained only)

After SFT (epoch 3)

12. ANALYSIS & LESSONS LEARNED

What v1 SFT proved

V1 limitations that drove the v2 redesign

Key insight: SFT is the critical phase at this scale

13. V2 SFT PIPELINE REDESIGN

Why redesign?

Master grouping strategy: 11 rules

API model comparison: 4-way pilot

V2 prompt engineering

V2 generation: final results

Question type distribution

Difficulty distribution

Pairs by grouping rule

14. REPRODUCIBILITY

Complete file inventory

To reproduce

Experiments tried & decisions locked