CREATING AI FROM SCRATCH

Started with a Turkish tokenizer, discovered that everything is a language. Building domain-specific AI from byte-pair encoding up.

PROJECT CONTEXT & JOURNEY

Why we started, what we discovered, and how a simple tokenizer question cracked open the entire AI landscape. The full story.

Custom BPE tokenizer for Turkish. 64K vocabulary, 2.7x efficiency over GPT-4, GPT-2 regex bug discovery, vocabulary saturation analysis.

Two models: 24.7M (v1) & 67.6M (v2). ALiBi, GQA, SwiGLU, RMSNorm. 3 rounds + v2 pretrain. 22 GB corpus, 2048 context. $92.83 v1 cost.

V1: 3,790 pairs (Opus 4.5). V2: 7,595 pairs from 707 groups, 11 rules, Claude Sonnet 4.6. RAG-grounded pipeline complete.

PHASE 5 • FUTURE

DPO preference pairs or RLVR with verifiable rewards. Optional — SFT is the critical phase at this model scale.

DPO / RLVR OPTIONAL

→