Back to Research

SUPERVISED FINE-TUNING (SFT)

Phase 4 — From 3,790 to 10,000 QA Pairs: Two Generations of ERP Assistant Training

February 2026 • Independent Research • IN PROGRESS

3,790
SFT TRAINING EXAMPLES
39.5%
VAL LOSS IMPROVEMENT
100s
TOTAL TRAINING TIME
639
TRAINING STEPS
Abstract. This report documents two generations of supervised fine-tuning (SFT) for Turkish ERP assistants built from scratch. V1 (24.7M params, 512-token context) used Claude Opus 4.5 to generate 3,790 QA pairs from 320 ERP documentation chunks, proving that SFT works: validation loss dropped from 5.12 to 3.10 in 100 seconds on an H100 GPU. V2 (67.6M params, 2048-token context) redesigns the entire pipeline: 11 grouping strategies produce 707 chunk configurations from 532 source chunks; Claude Sonnet 4.6 (selected via 4-way API comparison) generates ~8,000–10,000 diverse QA pairs with factual, procedural, comparative, inferential, and list-type questions at three difficulty levels. Every training example includes the source context chunk, training the model as a RAG context converter rather than a knowledge memorizer. Sections 1–12 document the completed v1 experiment; Section 13 documents the v2 redesign currently in progress.

TABLE OF CONTENTS

1. The Full Pipeline 2. Pretraining History 3. ERP Documentation Source 4. Synthetic Data Generation (V1) 5. Prompt Engineering (V1) 6. V1 QA Pair Statistics 7. Chat Template & Tokenization 8. Loss Masking Strategy 9. V1 SFT Training Configuration 10. V1 Results 11. V1 Model Outputs 12. Analysis & Lessons Learned 13. V2 SFT Pipeline Redesign 14. Reproducibility

1. THE FULL PIPELINE

The model follows a standard modern LLM training pipeline. Each phase builds on the previous one, progressively narrowing the model’s capabilities from general language understanding to domain-specific instruction following.

Two generations of SFT are documented here: v1 (24.7M, proof of concept) and v2 (67.6M, production pipeline).

TOKENIZER (64K BPE)
V1: 24.7M (R1+R2)
V1 SFT (3,790 pairs)
V2: 67.6M (2048 ctx)
V2 SFT (~10K pairs)
PhasePurposeDataResult
TokenizerEfficient Turkish text encoding22 GB, 11 domains64K vocab, 2.7× vs GPT-4
V1 ArchitectureInitial model design24.7M params, ALiBi/GQA/SwiGLU
Pretrain R1+R2Basic → deep Turkish22 GB corpus278K steps, loss 3.46
V1 SFTERP domain proof-of-concept3,790 QA pairs (Opus 4.5)639 steps, val loss 5.12 → 3.10
Pretrain R2.52048-token context for RAG22 GB corpus228K steps, loss 3.22 (best)
V2 ArchitectureRAG-optimized model67.6M params, d_model=512, 2:1 GQA
V2 PretrainFull 67.6M pretraining22 GB corpusIN PROGRESS
V2 SFT dataRAG-grounded ERP assistant7,595 pairs (Sonnet 4.6)COMPLETE
RL (optional)DPO / RLVR if SFT insufficientTBDFUTURE

2. PRETRAINING HISTORY

The v1 model went through three pretraining rounds before SFT. Round 2 (shown below) established deep language understanding. Round 2.5 later extended context to 2048 tokens for RAG. The v2 model (67.6M) is a separate architecture pretrained from scratch. Full pretraining details are in the Architecture & Pretraining report.

228K
TOTAL STEPS
14.9B
TOKENS PROCESSED
10.5h
TRAINING TIME
3.46
FINAL LOSS
3.39
BEST LOSS

Loss curve progression

StepLossLRTok/sSample Quality
50,000 (R1)2.623.0e-04~16K (MPS)Basic Turkish grammar
95,000~3.602.0e-04~399K (H100)Simple sentences, some repetition
145,000~3.501.1e-04~401KCoherent sentences with context
195,000~3.474.4e-05~402KFactual content: “Kocaeli’ndeyiz”
228,0003.463.0e-05~403KReal-world knowledge, correct grammar
Note on loss difference: Round 1 loss (2.62) and Round 2 loss (3.46) are not directly comparable. Round 1 trained on a ~500MB curated subset; Round 2 trained on 22GB of diverse text spanning 11 domains. The higher absolute loss reflects the much harder prediction task across legal, medical, financial, news, and literary text — not a regression in model quality. Sample outputs confirm dramatically improved language understanding.

Sample evolution during Round 2

Step 95,000
“Merhaba deyin! Bu e-postayı seviyorum! Bu e-postanın amacı, bu e-postanın ana no…”
Step 145,000
“Türkiye Türkiye’de ve dünyada ekonomik gelişmeler açısından büyük önem taşımaktadır”
Step 195,000
“Türkiye’nin en büyük ikinci sanayi şehri konumundaki Kocaeli’ndeyiz. En önemli t…”
Step 200,000 — Best loss 3.39
“Türkiye’de tüm dünyada ‘insan hakları’ndan söz edildiği gibi, Türkiye’de de, ulu…”
Scaling law observation: At 26M parameters, the model is capacity-limited, not data-limited. Chinchilla-optimal training for 26M params is ~520M tokens (20× params), but the model processed 14.9B tokens (~573× params). Loss was still decreasing at step 228K but with diminishing returns — the model had exhausted most of its representational capacity.
What happened next: Round 2.5 retrained with max_seq_len=2048 and dropout=0.02 for RAG compatibility, achieving best loss 3.22 in 26.5 hours. The v2 architecture (67.6M, d_model=512, 2:1 GQA) was then designed to break the capacity ceiling — 4.2× more transformer parameters. See Section 15 and Section 16 of the Architecture report.

3. ERP DOCUMENTATION SOURCE

The SFT training data was generated from the Solen Kablo ERP system documentation — the same system the model is being built to assist with. The documentation was already pre-processed into structured chunks with rich metadata as part of a RAG (Retrieval-Augmented Generation) pipeline.

1,074
TOTAL CHUNKS
8
ERP MODULES
215K
TOTAL TOKENS
202
AVG TOKENS/CHUNK

Modules covered

ModuleDescriptionChunks (TR)Chunks (EN)
AdminUser management, roles, authentication, system settings~80~80
HammaddeRaw materials: purchase orders, QR tracking, supplier management~90~85
StokInventory: warehouse management, stock levels, movements~85~85
TeknikCable database: specifications, standards, production recipes~95~90
LabQuality control: test procedures, measurements, certificates~60~55
ÜretimProduction: work orders, machine management, scheduling~50~50
SatışSales: customer orders, quotations, delivery tracking~40~45
FinansFinance: invoicing, payments, cost analysis~32~52
Total532542

Chunk metadata

Each chunk contains structured metadata used to guide the QA generation:

FieldPurposeExample
chunk_idUnique identifier for resume capabilityerp-mod-hammadde-tr_chunk_042
moduleERP module nameHammadde (Raw Materials Management)
section_headingDocumentation sectionSipariş Yönetimi
breadcrumbNavigation pathHammadde > Siparişler > Yeni Sipariş
languageSource languagetr or en
token_countChunk size (for filtering)186
has_tableContains tabular datatrue
has_codeContains code/API referencesfalse
references_modulesCross-module references["Stok", "Üretim"]

4. SYNTHETIC DATA GENERATION (V1)

The central challenge of SFT for a domain-specific model is data. Hand-writing thousands of QA pairs is impractical; using generic instruction datasets would not teach the model about Solen Kablo’s ERP system. The solution: use a large cloud LLM as a data conversion tool — not a knowledge source — to transform existing ERP documentation into training-ready QA pairs.

V1 vs V2 pipeline. This section describes the v1 approach (Claude Opus 4.5, individual chunks, 3,790 pairs). The v2 pipeline (Claude Sonnet 4.6, 11 grouping rules, ~10K pairs) is documented in Section 13.
ERP Docs (HTML)
Chunk + Metadata
Claude Opus 4.5
QA Pairs (JSONL)
ChatML Format

V1 design decisions

DecisionV1 ChoiceV2 Change
LLMClaude Opus 4.5Claude Sonnet 4.6 (better question diversity)
GroupingIndividual chunks only11 rules: individual, submodule, module, cross-module, etc.
Target model24.7M, 512 context67.6M, 2048 context
FocusUser-centric questionsSame + technical questions (no audience restriction)
FormatSingle + multi-turn (70/30)Single-turn only (RAG: one question, one answer)
DifficultyGraded, length-calibrated for 26MGraded, max 200 words (uncapped format)
Volume3,790 pairs from 320 chunks~8–10K pairs from 707 groups
Anti-hallucination rule. The LLM was explicitly instructed: “HER cevap DOĞRUDAN verilen metin parçasından gelmeli. Metinde OLMAYAN bilgi EKLEME — hiçbir şey uydurma.” (Every answer must come DIRECTLY from the given text chunk. Do NOT add information that is NOT in the text — make nothing up.) This ensures the training data contains only verified information from the actual ERP documentation.

5. PROMPT ENGINEERING (V1)

The v1 generation pipeline uses a two-part prompt: a comprehensive system prompt (the “Grand Prompt”) that defines the task, formats, rules, and examples; and a per-chunk user message that provides metadata and the documentation text. The v2 prompt is significantly redesigned — see Section 13.

System prompt structure

SectionPurpose
SİSTEM HAKKINDAContext about the ERP: 8 modules, cable factory, user types
TEK GÖREVİNSingle task definition: convert text to user-focused QA
HEDEF MODEL HAKKINDA26M parameter constraints: concise answers, no long paragraphs
İKİ TİP VERİOutput format: single-turn and multi-turn JSON schemas
ZORLUK SEVİYELERİDifficulty definitions: kolay (1-2 sent), orta (2-4), zor (4-6)
ODAK NOKTASIUser-centric focus: “How do I use the system?” not code details
KRİTİK KURALLAR12 rules including anti-hallucination, translation, coverage
ÖRNEKLERGood/bad examples showing desired vs undesired output style

Answer length calibration for v1 (26M parameters)

DifficultyTarget LengthQuestion TypeExample
Kolay (Easy)1–2 sentencesSingle fact: “X nedir?”“QR kod ne işe yarar?”
Orta (Medium)2–4 sentencesProcess/steps: “X nasıl yapılır?”“Sisteme yeni bakır girişi nasıl yapılır?”
Zor (Hard)4–6 sentencesMulti-fact/scenario“Sipariş tarihi değişirse ve kısmi teslimat yapılmışsa ne yapmalıyım?”
Why calibrate length? (V1) A 26M parameter model cannot reliably generate long, complex answers. By constraining answer lengths during data generation, the model learns patterns it can actually reproduce. The v2 pipeline relaxes this for the 67.6M model: answers can be up to 200 words with flexible formatting (numbered steps, paragraphs, or lists), learned from examples rather than rigid templates.

Bad vs good question examples (from the prompt)

TypeQuestionProblem / Reason
BAD“raw_materials tablosunun sütunları nelerdir?”Too technical — DB schema, not user question
BAD“POST /api/materials endpoint’i ne döndürür?”API detail — users don’t know endpoints
GOOD“QR kod ne işe yarar?”User-centric, natural language
GOOD“Sisteme yeni bakır girişi nasıl yapılır?”Process-focused, practical
GOOD“Operatör kalay teslim aldığında ne yapmalı?”Scenario-based, role-aware

6. V1 QA PAIR STATISTICS

The v1 generation script processed 320 out of 529 Turkish chunks before API credits were exhausted (~$20 on Claude Opus 4.5). The resulting dataset was sufficient for the 26M parameter model. For v2 statistics (~8–10K pairs from 707 groups), see Section 13.

3,790
TOTAL QA ITEMS
3,170
SINGLE-TURN (84%)
620
MULTI-TURN (16%)
11.7
AVG ITEMS PER CHUNK

Difficulty distribution

1,451
KOLAY / EASY (38%)
1,593
ORTA / MEDIUM (42%)
746
ZOR / HARD (20%)

Generation cost

MetricValue
Model usedClaude Opus 4.5 (claude-opus-4-5-20251101)
Chunks processed320 / 529 Turkish chunks (61%)
API cost~$20
Generation rate~24 items/minute
Items per chunk~11.7 average
Output filesft_data/erp_qa_pairs.jsonl
ChatML filesft_data/erp_sft_chatml.jsonl (2.64 MB)

7. CHAT TEMPLATE & TOKENIZATION

The SFT data uses a Llama 3–style chat template, leveraging the special tokens already built into the tokenizer during Phase 1. This was a deliberate design decision: the tokenizer was built with instruction-tuning tokens before the model existed, anticipating this exact use case.

Special tokens used

TokenIDRole in Chat Template
<|begin_of_text|>0Start of conversation
<|start_header_id|>4Opens role header (system/user/assistant)
<|end_header_id|>5Closes role header
<|eot_id|>6End of turn marker
<|pad|>2Padding for batched training

Template structure

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Sen Solen Kablo ERP sisteminin yapay zeka asistanısın...
<|eot_id|><|start_header_id|>user<|end_header_id|>

QR kod ne işe yarar?
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

QR kod her hammaddeye atanan benzersiz bir takip kodudur...
<|eot_id|>

Multi-turn extension

For multi-turn conversations (2–3 turns), the template simply repeats the user/assistant blocks. Each turn ends with <|eot_id|>, and the model learns to generate until it produces this token.

8. LOSS MASKING STRATEGY

A critical detail of SFT: loss is computed only on assistant tokens. The model must learn to predict assistant responses, but it should not be penalized for failing to predict the system prompt or user questions (which are given as input, not generated).

Token-level loss mask

Token sequence:
[BOS] [HEADER:system] system content... [EOT]
      [HEADER:user]   user question... [EOT]
      [HEADER:asst]   assistant answer... [EOT]

Loss mask:
  -1    -1  -1  -1  -1  ...  -1     ← system (ignored)
  -1    -1  -1  -1  -1  ...  -1     ← user (ignored)
  -1    -1  YES YES YES ... YES    ← assistant content + EOT (trained)

The ignore_index=-1 parameter in PyTorch’s cross_entropy function handles this natively. Positions marked -1 contribute zero to the loss. Only the assistant’s content tokens and the trailing <|eot_id|> are trained on.

Why mask the header too? Even the <|start_header_id|>assistant<|end_header_id|>\n\n prefix is masked. The model doesn’t need to learn to predict the role header — it will always be provided as part of the prompt template. Training only on content maximizes the signal-to-noise ratio for any model’s parameter budget. This strategy is used for both v1 (26M) and v2 (67.6M).

9. V1 SFT TRAINING CONFIGURATION

ParameterValueRationale
Base checkpointstep_228000.ptBest available pretrained model
Epochs3Small dataset benefits from multiple passes
Batch size8 × 2 accum = 16Effective batch of 16 conversations
Learning rate2e-5 → 2e-6~15× lower than pretraining (3e-4)
LR scheduleCosine with warmup63 warmup steps (10% of total)
Dropout0.05Regularization for small dataset (was 0.0 in pretraining)
Weight decay0.01Lower than pretraining (0.1)
Gradient clip1.0Stability
OptimizerAdamW (fresh)New optimizer state, not resumed from pretraining
Validation split10% (379 conversations)Held-out evaluation after each epoch
3,411
TRAIN CONVERSATIONS
379
VALIDATION CONVERSATIONS
141K
TRAINABLE TOKENS (ASSISTANT ONLY)
Why a fresh optimizer? The pretrained model’s AdamW momentum was tuned for next-token prediction on raw Turkish text. SFT is a fundamentally different task (instruction following) with a much smaller learning rate. Reusing the old optimizer state would inject stale momentum from pretraining, potentially causing instability. A fresh optimizer allows clean adaptation.

10. V1 RESULTS

5.12
VAL LOSS BEFORE SFT
3.10
VAL LOSS AFTER SFT
2.01
ABSOLUTE IMPROVEMENT
39.5%
RELATIVE IMPROVEMENT

Per-epoch breakdown

EpochTrain LossVal LossBest?Time
Before SFT5.120s
Epoch 18.533.3428s
Epoch 26.553.1460s
Epoch 36.073.10★ Best100s

Train loss is higher than val loss because the training loss is computed per-batch during gradient updates (with dropout active), while validation loss is computed over the full validation set with dropout disabled.

Key result: In 100 seconds of SFT training on an H100 GPU, the model went from producing incoherent repetitions (“Digital program Digital program…”) to generating domain-relevant Turkish answers about the ERP system. The validation loss was still decreasing at epoch 3, suggesting more epochs could yield further improvement.

11. V1 MODEL OUTPUTS

Before SFT (pretrained only)

BEFORE SFT — Question: “QR kod ne işe yarar?”
“Digital program Digital program Digital program Digital program Digital program…”

The pretrained model has no concept of the chat template, question-answering, or ERP knowledge. Given the structured prompt, it falls into degenerate repetition.

After SFT (epoch 3)

AFTER SFT — Question: “QR kod ne işe yarar?”
S: QR kod ne işe yarar?
“QR kod, bir malzemenin hangi malzeme türüne ait olduğunu takip eder.”
AFTER SFT — Question: “Sisteme yeni bakır girişi nasıl yapılır?”
S: Sisteme yeni bakır girişi nasıl yapılır?
“Sistem bakır için yeni bakır girişini otomatik olarak hesaplar. Bu sayede bakırın tüm üretim süreci takip edilir.”
AFTER SFT — Question: “Hammadde modülünde neler yapabilirim?”
S: Hammadde modülünde neler yapabilirim?
“Hammadde modülünde farklı hammaddeler için malzemeler listesi oluşturulur.”
AFTER SFT — Question: “Satış siparişi nasıl oluşturulur?”
S: Satış siparişi nasıl oluşturulur?
“Satış siparişi otomatik olarak açılır menüde gösterilir. Bu menüde sipariş detayları, tarih ve saat alanları, teslim edilen miktar, teslimat tarihi alanlarına ağırlık ayarlarıyla birlikte gösterilir.”
Observations: The model (1) speaks coherent Turkish with correct grammar and suffixes, (2) stays on-topic about ERP operations, (3) uses domain vocabulary correctly (malzeme, sipariş, teslimat, modül), and (4) follows the question-answer pattern learned during SFT. Answers are concise, matching the difficulty-calibrated length targets. Some answers remain somewhat generic — expected after only 3 epochs of SFT on a 26M model.

12. ANALYSIS & LESSONS LEARNED

What v1 SFT proved

V1 limitations that drove the v2 redesign

V1 LimitationRoot CauseV2 Solution
Repetitive outputs 26M model capacity + no dropout + 512-token context 67.6M model + dropout 0.02 + 2048 context
Generic, ungrounded answers No RAG context in training — model guesses from memorized patterns Every training example includes the source chunk as context
Only 60% chunk coverage API credits exhausted at 320/529 chunks All 532 chunks processed via 707 groups across 11 rules
Single-chunk questions only No grouping strategy — each chunk processed individually 11 grouping rules (submodule, cross-module, data flow, etc.)
No audience diversity V1 prompt targeted “office worker” questions only V2 generates both user-level and technical questions
Sonnet 4 / Opus 4.5 API Good but not optimal for Turkish Q&A diversity Sonnet 4.6 selected via 4-way comparative pilot

Key insight: SFT is the critical phase at this scale

For models under ~1B parameters, SFT training data quality is the single most important factor. RL techniques (DPO, RLHF, RLVR) provide diminishing returns at this scale because the model lacks the capacity to benefit from fine-grained preference signals. Instead of chasing RL, the strategy is: (1) maximize SFT data quality via better prompts, diverse grouping, and superior API models; (2) improve the retrieval pipeline so the model sees better context at inference time; (3) iterate on the RAG prompt engineering for the deployed system. RL remains an optional future phase if SFT alone proves insufficient.
Experiment: intentional wrong answers in SFT. An earlier experiment injected incorrect answers that self-corrected mid-response (e.g., “wait, this is wrong…”). Result: overall model quality degraded significantly. Never inject bad answers into SFT data. If preference learning is needed, use DPO with separate (chosen, rejected) pairs instead.

13. V2 SFT PIPELINE REDESIGN

The v1 SFT (3,790 pairs from Claude Opus 4.5) proved the concept but exposed limitations: only 320 of 529 chunks were processed before API credits ran out, the grouping strategy was basic (individual chunks only), and the 512-token context couldn’t fit real RAG prompts. The v2 pipeline was redesigned from scratch for the 67.6M RAG-optimized model.

707
CHUNK GROUPS
11
GROUPING RULES
532
SOURCE CHUNKS
408K
INPUT TOKENS
7,595
QA PAIRS GENERATED

Why redesign?

Master grouping strategy: 11 rules

Each rule produces a different perspective on the same documentation, forcing diverse question types:

#RuleGroupsTokensDescription
1Individual532107KEach chunk alone — factual, definition, basic procedure questions
2Submodule86107KChunks grouped by submodule — cross-chunk synthesis within a feature
3Module8107KAll chunks per module — high-level architectural questions
4Vertical stack1013KSame feature at different depths (overview → detail → API)
5Horizontal siblings1618KParallel features at same depth — comparison questions
6Cross-module67KRelated features across different modules
7Overview + detail1818KModule overview paired with specific submodule details
8Data flow chain47KSequential process chains (e.g., order → production → delivery)
9Foundation + consumer106KBase definitions paired with features that use them
10Shared DB tables1316KFeatures sharing database tables — data integration questions
11Module map43KFull module structure maps for navigation questions
Total707408K
Distribution note. Rule 1 (individual chunks) produces 75% of all groups (532/707) and ~80% of expected QA pairs. This is intentional: the majority of real user queries will be answerable from a single retrieved chunk. The remaining 10 rules (175 groups, ~20% of pairs) train the model on harder multi-chunk reasoning that occurs when the retriever returns related but separate passages.

API model comparison: 4-way pilot

Before committing to a single API model for all 707 groups, a controlled pilot was run on the same 10 chunk groups with 4 different models:

ModelPairsQuestion DiversityAnswer DepthInference QualityRepetition
Claude Sonnet 499Low — mostly factualAdequateShallowSome
GPT-5.2101ModerateGoodGoodMinimal
Claude Opus 4.6100GoodExcellentVery goodMinimal
Claude Sonnet 4.6100ExcellentExcellentBest — deep inferenceMinimal
Winner: Claude Sonnet 4.6. Sonnet 4.6 produced the most diverse question types (factual, procedural, comparative, inferential, list) with the deepest inference-based questions. It consistently generated questions that require combining information from multiple parts of the context, rather than simple lookups. Sonnet 4 was notably weaker — its questions were almost entirely factual with shallow answers. Opus 4.6 was close but slightly less diverse.

V2 prompt engineering

The data generation uses a two-level prompt architecture:

1. SFT system prompt (embedded in every training example, 38 tokens):

ERP sistemi asistanısın. Verilen bağlam bilgilerini kullanarak
soruyu yanıtla. Bağlamda cevap yoksa "Bu konuda bilgim yok" de.
System prompt design decisions:

2. API system prompt (sent to Claude Sonnet 4.6, not seen by the model):

A detailed Turkish-language instruction set covering:

V2 generation: final results

7,595
TOTAL QA PAIRS
707/707
GROUPS COMPLETE
10.7
AVG PAIRS/GROUP
0
INVALID PAIRS
44.9 MB
RAW DATA SIZE
MetricValue
Groups processed707 / 707 (100%)
QA pairs generated7,595
Invalid pairs0
Parse errors (retried)2 (both resolved on retry)
Avg / Min / Max pairs per group10.7 / 5 / 29

Question type distribution

TypeCountPercentage
Olgusal (factual)3,85050.7%
Çıkarımsal (inferential)1,16415.3%
Liste (list)99913.2%
Prosüdürel (procedural)89611.8%
Karşılaştırmalı (comparative)6869.0%

Difficulty distribution

3,791
KOLAY / EASY (49.9%)
2,629
ORTA / MEDIUM (34.6%)
1,175
ZOR / HARD (15.5%)

Pairs by grouping rule

#RuleGroupsPairsAvg/Group
1Individual5324,3498.2
2Submodule861,44516.8
3Module822828.5
4Vertical stack1020320.3
5Horizontal siblings1630919.3
6Cross-module612020.0
7Overview + detail1835119.5
8Data flow chain49924.8
9Foundation + consumer1015015.0
10Shared DB tables1327120.8
11Module map47017.5
Total7077,59510.7
V1 → V2 final comparison. V1 produced 3,790 pairs from 320 individual chunks using Claude Opus 4.5. V2 produced 7,595 pairs from 707 multi-configuration groups using Claude Sonnet 4.6 — a 2.0× increase in data volume with substantially higher question diversity (11 grouping rules vs 1, 5 question types, 3 difficulty levels).
MetricV1V2
API modelClaude Opus 4.5Claude Sonnet 4.6
Chunks processed320 / 529 (60%)532 / 532 (100%)
Grouping rules1 (individual only)11 rules
Total groups320707
QA pairs3,7907,595
Question typesMixed, unstructured5 types, difficulty-graded
RAG context in trainingNoYes — every example
Output filesft_data/erp_qa_pairs.jsonlerp_rag/data/sft_raw_pairs.json (44.9 MB)
ChatML filesft_data/erp_sft_chatml.jsonl (2.64 MB)erp_rag/data/sft_train.jsonl (45.8 MB)

14. REPRODUCIBILITY

Complete file inventory

V1 SFT files:

FilePurposeSize
scripts/generate_erp_qa.pyV1 QA generation (Claude Opus 4.5)~12 KB
sft_data/erp_qa_pairs.jsonlV1 raw QA pairs~3 MB
sft_data/erp_sft_chatml.jsonlV1 ChatML training data2.64 MB
tiny_llm/train_sft.pyV1 SFT training loop~14 KB
tiny_llm/checkpoints/sft/sft_best.ptV1 best SFT checkpoint (epoch 3)94 MB

V2 SFT files:

FilePurposeSize
erp_rag/generate/sft_generate.pyV2 data generation pipeline (Claude Sonnet 4.6)~18 KB
erp_rag/data/sft_chunk_groups.jsonMaster grouping blueprint (707 groups, 11 rules)~400 KB
erp_rag/data/sft_raw_pairs.jsonV2 raw QA pairs with context44.9 MB
erp_rag/data/sft_train.jsonlV2 ChatML training data (7,595 examples)45.8 MB
tiny_llm/train_sft_rag.pyV2 RAG-grounded SFT training loop~20 KB

Shared files:

FilePurposeSize
tiny_llm/sft_data.pyTokenization and assistant-only loss masking~4 KB
tiny_llm/chat.pyInteractive chat with SFT model~5 KB
erp_rag/data/chunks/all_chunks.jsonPre-processed ERP documentation (1,074 chunks)1.9 MB

To reproduce

# V2 pipeline (recommended)

# 1. Generate QA pairs (requires Anthropic API key, ~$20-25)
pip install anthropic
export ANTHROPIC_API_KEY="sk-ant-..."
python -m erp_rag.generate.sft_generate \
    --provider anthropic --model claude-sonnet-4-6   # 707 groups → 7,595 pairs, auto-resume on interruption

# 2. Upload data to RunPod and run SFT training
python -m tiny_llm.train_sft_rag \
    --model v2 \
    --data erp_rag/data/sft_raw_pairs.json           # reads raw pairs, builds ChatML, trains with loss masking

# 3. Chat with the model
python -m tiny_llm.chat

Experiments tried & decisions locked

DO NOT RE-TRY: These SFT experiments and configurations were already evaluated.
ExperimentWhat Was TriedResultDecision
SFT on local Mac (MPS) Ran train_sft.py on M4 MacBook CPU hit 93°C; killed immediately RunPod H100 only — locked
torch.compile checkpoint loading Loaded step_228000.pt directly into non-compiled model All keys mismatched (_orig_mod. prefix); loss ~20.0 Always strip _orig_mod. prefix — locked
V1: Claude Opus 4.5 320/529 chunks processed before API credits exhausted ($20) 3,790 QA pairs (3,170 single + 620 multi-turn) Superseded by V2 pipeline
V2: Claude Sonnet 4.6 707/707 groups, 11 rules, 532 chunks, ~$20–25 7,595 QA pairs, 0 invalid, 5 types, 3 difficulty levels V2 pipeline — locked. Production-ready SFT data.
SFT dropout 0.05 Increased from pretraining’s 0.0 Val loss improved steadily across 3 epochs; no overfitting dropout 0.05 for SFT — locked
LR 2e-5 for SFT 10× lower than pretraining peak (3e-4) Stable training; val loss 5.12 → 3.10 across 3 epochs LR 2e-5 for SFT — locked
3 SFT epochs Trained for 3 full epochs over 3,790 conversations Best val loss at epoch 3 (3.10); still improving Could train more epochs, but model already follows instructions
Fresh optimizer for SFT New AdamW (did not carry pretraining optimizer state) Correct approach: SFT loss surface differs from pretraining Fresh optimizer for SFT — locked
Loss masking (assistant-only) IGNORE=-1 for system/user tokens; only assistant+EOT contribute to loss Model learns to generate answers, not memorize questions Assistant-only loss — locked
Cost breakdown for the entire project (Phases 1–4): Tokenizer training: free (CPU). Pretraining v1 (R1+R2+R2.5): ~$92.83 (39h × $2.38/hr). V1 SFT data generation: ~$20 (Claude Opus 4.5 API). V1 SFT training: <$0.10 (100 seconds on H100). V2 SFT data generation: ~$22 (Claude Sonnet 4.6 API, 707 calls). Total so far: ~$135 (excluding v2 pretraining).
Conclusion. Two generations of SFT have been built for this project. V1 proved the concept: 3,790 QA pairs from Claude Opus 4.5 transformed a 24.7M pretrained model into a functional ERP assistant in 100 seconds. V2 redesigns everything at scale: 11 grouping strategies produce 707 chunk groups from 532 ERP documentation chunks; Claude Sonnet 4.6 (selected via 4-way comparative pilot) generates ~8,000–10,000 diverse QA pairs spanning factual, procedural, comparative, inferential, and list-type questions at three difficulty levels. The target model (67.6M v2 with 2048-token context) was purpose-built as a RAG context converter. The entire pipeline — tokenizer, architecture, pretraining, data generation, SFT training — remains built from scratch with no pretrained weights, no HuggingFace trainers, and no off-the-shelf datasets. V2 SFT data generation is complete: 7,595 pairs, ready for training. Total project cost to date: ~$135.

© 2026 • Independent Research