CREATING AI FROM SCRATCH
Started with a Turkish tokenizer, discovered that everything is a language. Building domain-specific AI from byte-pair encoding up.
PROJECT CONTEXT & JOURNEY
Why we started, what we discovered, and how a simple tokenizer question cracked open the entire AI landscape. The full story.
PHASE 1 • PUBLISHEDTOKENIZER RESEARCH
Custom BPE tokenizer for Turkish. 64K vocabulary, 2.7x efficiency over GPT-4, GPT-2 regex bug discovery, vocabulary saturation analysis.
PHASE 2 & 3 • PUBLISHEDARCHITECTURE & PRETRAINING
Two models: 24.7M (v1) & 67.6M (v2). ALiBi, GQA, SwiGLU, RMSNorm. 3 rounds + v2 pretrain. 22 GB corpus, 2048 context. $92.83 v1 cost.
PHASE 4 • PUBLISHEDFINE-TUNING (SFT)
V1: 3,790 pairs (Opus 4.5). V2: 7,595 pairs from 707 groups, 11 rules, Claude Sonnet 4.6. RAG-grounded pipeline complete.
REINFORCEMENT LEARNING
DPO preference pairs or RLVR with verifiable rewards. Optional — SFT is the critical phase at this model scale.