CREATING AI FROM SCRATCH

Started with a Turkish tokenizer, discovered that everything is a language. Building domain-specific AI from byte-pair encoding up.

PHASE 0 • PUBLISHED

PROJECT CONTEXT & JOURNEY

Why we started, what we discovered, and how a simple tokenizer question cracked open the entire AI landscape. The full story.

CONTEXT 12 SECTIONS
PHASE 1 • PUBLISHED

TOKENIZER RESEARCH

Custom BPE tokenizer for Turkish. 64K vocabulary, 2.7x efficiency over GPT-4, GPT-2 regex bug discovery, vocabulary saturation analysis.

64K VOCAB 2.7x EFFICIENCY
PHASE 2 & 3 • PUBLISHED

ARCHITECTURE & PRETRAINING

Two models: 24.7M (v1) & 67.6M (v2). ALiBi, GQA, SwiGLU, RMSNorm. 3 rounds + v2 pretrain. 22 GB corpus, 2048 context. $92.83 v1 cost.

~44.7B TOKENS 506K+ STEPS
PHASE 4 • PUBLISHED

FINE-TUNING (SFT)

V1: 3,790 pairs (Opus 4.5). V2: 7,595 pairs from 707 groups, 11 rules, Claude Sonnet 4.6. RAG-grounded pipeline complete.

7,595 EXAMPLES 707 GROUPS
PHASE 5 • FUTURE

REINFORCEMENT LEARNING

DPO preference pairs or RLVR with verifiable rewards. Optional — SFT is the critical phase at this model scale.

DPO / RLVR OPTIONAL