LOCAL RAG SYSTEM FOR ERP
Building a fully local, private RAG (Retrieval-Augmented Generation) system that answers questions about the Solen ERP using a 522MB language model. Zero cloud dependency. Zero cost. Runs on a MacBook Air.
1. SYSTEM OVERVIEW
The ERP system has 8 modules, 200+ API endpoints, 50+ database tables, and bilingual documentation (EN/TR). We built a RAG system that lets any user — technical or not — ask questions in natural language and get accurate, sourced answers.
2. ARCHITECTURE & PIPELINE
Any language, any charset. "projeksiyon nasil calisiyor"
rank_bm25 · custom bilingual tokenizer · 309 documents
BAAI/bge-m3 embeddings · 1024-dim · cosine similarity
k=60 · merges keyword + semantic rankings
KILLED — 7s latency for marginal quality gain
Lean context format · [Module > Section] tags · best-first ordering
Chain-of-thought enabled · num_ctx=2048 · num_predict=800
Automatic · only triggers when model says "I don't know"
User-friendly language · no jargon · breadcrumb citations
3. DATA INGESTION
The ERP documentation exists as HTML files — one per module, bilingual (EN/TR). The ingestion pipeline parses these into structured chunks with rich metadata.
Ingestion Pipeline
- HTML Parsing —
html_parser.pyextracts sections from the documentation HTML files, preserving heading hierarchy, code blocks, tables, and API endpoint patterns. - Chunking —
chunker.pysplits sections into chunks (target ~500 tokens each). Each chunk carries:module,language,breadcrumb,section_id,has_api_endpoints,has_table,has_code,db_tables,api_endpoints. - Output — 309 chunks saved as
all_chunks.jsonwith amanifest.jsonsummarizing the ingestion.
Quality Assurance: 3 Test Levels
| TEST | FILE | CHECKS | RESULT |
|---|---|---|---|
| L1: Data Quality | test_data_quality.py | Schema, types, ranges, duplicates | ALL PASS |
| L2: Deep Quality | test_data_quality_l2.py | Metadata coherence, cross-references | ALL PASS |
| Final Boss | test_final_boss.py | 66 ground-truth assertions | 66/66 PASS |
4. BM25 KEYWORD INDEX
Okapi BM25 for keyword-level matching. Critical for exact terms like table names (half_product_stock), API paths, and Turkish technical terms that embedding models may miss.
Custom Tokenizer
Built a bilingual tokenizer that:
- Preserves underscores (critical for
table_namesandapi_endpoints) - Handles Turkish characters (ç, ğ, ι, ö, ş, ü)
- Removes EN + TR stop words (70+ words combined)
- Minimum token length: 2 characters
5. CHROMADB VECTOR STORE
Semantic search using BAAI/bge-m3 (multilingual, 1024-dimensional embeddings) stored in a persistent ChromaDB collection with cosine similarity.
The $contains Bug
ChromaDB 1.5.x's $contains operator does NOT perform substring matching despite its name. Discovered this through test failures. Fixed by implementing Python-side substring resolution in _resolve_modules() that converts to $in with exact module names.
6. HYBRID SEARCH & RECIPROCAL RANK FUSION
BM25 catches exact keywords. ChromaDB catches semantic meaning. Neither alone is sufficient. RRF merges their ranked lists into a single ranking without needing to normalize their incompatible score scales.
RRF Formula
score(d) = Σ 1 / (k + rank_i(d)) where k = 60
Documents that appear in both rankings get boosted. Documents that appear in only one still contribute. The k=60 constant (from Cormack et al., 2009) prevents top-1 results from dominating.
Retrieval Final Boss Test
A 6-phase test suite validating the full BM25 + ChromaDB pipeline:
| PHASE | DESCRIPTION | RESULT |
|---|---|---|
| 1. BM25 Ground Truth | Known queries must find known chunks | PASS |
| 2. ChromaDB Semantic | Meaning-based queries find relevant docs | PASS |
| 3. Cross-Engine Consistency | Both engines agree on top results | PASS |
| 4. Filter Integrity | Language/module filters work correctly | PASS |
| 5. Edge Cases | Empty queries, special chars, long queries | PASS |
| 6. Latency | Search completes under 500ms | PASS |
7. CROSS-ENCODER RERANKER KILLED
We built a cross-encoder reranking stage using BAAI/bge-reranker-v2-m3 (~1.1GB). It sees query+document together for more accurate scoring than bi-encoder similarity alone.
A/B Test Results
| METRIC | WITHOUT RERANKER | WITH RERANKER |
|---|---|---|
| Retrieval latency | ~150ms | ~7,000ms |
| Quality (manual eval) | Good | Slightly better |
| RAM usage | ~2.3GB | ~3.4GB (+1.1GB) |
The reranker code remains in retrieve/reranker.py for future use on a server with proper cooling. For interactive local use: killed.
8. THE GRAND MODEL TOURNAMENT
Before choosing the generation model, we ran a rigorous tournament: 9 models, 29 test cases, covering hallucination resistance, Turkish language, precision, instruction following, multi-hop reasoning, and adversarial prompts.
Contestants
| MODEL | SIZE | PASSED | AVG LATENCY | TOK/S |
|---|---|---|---|---|
| gemma3:4b | 3.3 GB | 27/29 | 4,831ms | 62 |
| qwen3:4b | 2.6 GB | 26/29 | 3,773ms | 82 |
| phi4-mini | 2.5 GB | 25/29 | 4,226ms | 74 |
| qwen3:1.7b | 1.1 GB | 25/29 | 2,175ms | 110 |
| qwen3:0.6b ← CHOSEN | 522 MB | 24/29 | 1,488ms | 125 |
| qwen2.5:3b | 1.9 GB | 23/29 | 3,432ms | 82 |
| llama3.2:3b | 2.0 GB | 22/29 | 2,861ms | 94 |
| gemma3:1b | 815 MB | 21/29 | 1,220ms | 143 |
| llama3.2:1b | 1.3 GB | 19/29 | 1,306ms | 133 |
Why qwen3:0.6b?
- 24/29 tests passed — only 3 fewer than the 4B champion, at 1/6th the size
- 522MB — fits in memory alongside the ERP system and embedding model
- 125 tok/s — fastest quality model in the tournament
- Chain-of-thought — uses
<think>tokens internally, punching above its weight
Test Categories (29 tests)
| CATEGORY | COUNT | WHAT IT TESTS |
|---|---|---|
| HALL (Hallucination) | 3 | Refuses to answer when context lacks info |
| TR (Turkish) | 3 | Turkish input/output without special chars |
| ADV (Adversarial) | 3 | Prompt injection, override attempts |
| NUM (Precision) | 3 | Exact numbers, counts, specific values |
| LONG (Long Context) | 3 | Multi-paragraph reasoning |
| INST (Instructions) | 3 | Format compliance (list, single-line, etc.) |
| HOP (Multi-hop) | 3 | Chaining facts across context sections |
| AMB (Ambiguity) | 3 | Contradictory info, missing data handling |
| CODE (Technical) | 3 | SQL, API path extraction, code understanding |
| DEG (Degenerate) | 2 | Empty context, gibberish input |
9. GENERATION LAYER
The generation layer wraps Ollama's API with thermal-aware settings optimized for the M4 MacBook Air.
LLM Configuration
| PARAMETER | VALUE | REASON |
|---|---|---|
| model | qwen3:0.6b | Best quality-to-size ratio from tournament |
| temperature | 0.1 | Low creativity, high accuracy for RAG |
| num_predict | 800 | Room for thinking tokens + answer |
| num_ctx | 2048 | We only use ~1000 tokens input; reduces CPU load |
The Thinking Token Discovery
qwen3 models use internal <think>...</think> chains. Ollama strips these from visible output but still counts them in eval_count. With num_predict=400, some answers were empty because all tokens went to thinking.
Solution: increase to 800. The model self-regulates — simple questions use ~150 tokens total, complex ones use ~400. The higher ceiling only activates when needed.
10. PROMPT ENGINEERING
Three iterations of system prompt design, each driven by test failures.
Iteration 1: Developer Mode (Abandoned)
"Answer from context only. If not found, say 'Not in context.' Cite sources as [Module > Section]. Be concise."
Problem: Answers were too technical. Users aren't developers.
Iteration 2: Heavy Rules (Failed)
"You are a friendly ERP helper for office workers.
Rules:
1) Explain step-by-step: which screen, which button, what to type.
2) NEVER say: API, endpoint, database, JWT, token, backend...
3) If context mentions '/api/...' translate it..."
Problem: Catastrophic. The 0.6B model couldn't handle the complex system prompt. Three different questions all returned "Go to Suppliers page and click Delete" — the model latched onto a template and repeated it regardless of the question.
Iteration 3: Slim Prompt (Current)
"You help office workers use the ERP system. Use simple language. No technical jargon. If not found, say 'I don't have information about that.'"
Key insight: For a 0.6B model, less instruction = better output. A heavy prompt chokes the tiny brain. A light one lets it use its capacity for actual reasoning.
Context Format
--- [Stock & Inventory > Projeksiyon Data Flow] (en) ---
The core of Projeksiyon lives in ProjeksiyonService.get_projeksiyon().
It reads every non-cancelled Work Card and Material Order...
--- [Hammadde > Supplier Management] (tr) ---
DELETE /api/suppliers/{id}/hard-delete endpoint...
Each chunk gets a lean [Module > Section] (lang) tag. Chunks ordered by relevance (best first) to combat the model's last-chunk-blindness.
11. FALLBACK RETRY MECHANISM
When the model says "I don't have information about that," the system automatically retries with 5 chunks instead of 3. This catches cases where the answer lives in chunk #4 or #5.
Detection Patterns
NO_CONTEXT_PATTERNS = [
"not in context",
"don't have information",
"bilgim yok",
"bağlamda .* yok",
"bulunamadı",
...
]
Why Not Always Use 5 Chunks?
A/B testing showed that 5 chunks reduces quality for the 0.6B model:
| METRIC | 3 CHUNKS | 5 CHUNKS |
|---|---|---|
| Correct answers | 9/10 | 8/10 |
| Hallucinations | 0 | 2 |
| Avg speed | 3.7s | 4.3s |
More context = more confusion for a tiny model. The irrelevant chunks become noise. With 5 chunks, the model hallucinated on two questions that it answered correctly (or honestly refused) with 3 chunks.
The retry mechanism gives us the best of both worlds: precision-first with 3 chunks, recall as fallback with 5.
12. BENCHMARKS VS OPENAI
Same 10 questions, same 3 chunks of context, same system prompt. Local qwen3:0.6b vs OpenAI cloud models.
| MODEL | CORRECT | HALLUCINATIONS | AVG LATENCY | COST/QUERY | OFFLINE |
|---|---|---|---|---|---|
| qwen3:0.6b (local) | 8/10 | 1 | 3,491ms | $0 | YES |
| gpt-4.1-nano | 7/10 | 0 | 2,087ms | ~$0.0002 | NO |
| gpt-4o-mini | 8/10 | 0 | 2,564ms | ~$0.0002 | NO |
| gpt-4.1-mini | 8/10 | 0 | 3,039ms | ~$0.0003 | NO |
Key Findings
- Quality is tied. A 522MB local model matches GPT-4o-mini at 8/10 correct answers.
- Different failure modes: Cloud models are more conservative (refuse when uncertain). Qwen is more aggressive (infers, but occasionally hallucinates).
- Latency is comparable. Local 3.5s vs cloud 2-3s. Network overhead vs faster compute — essentially the same user experience.
- The killer difference: Local is free, offline, private. ERP data never leaves the machine.
GPT-5 series models (reasoning models with internal chain-of-thought) were tested but required 2000+ completion tokens and 5-10s per query. Not practical for RAG.
13. THERMAL MANAGEMENT
The M4 MacBook Air has passive cooling (no fan). Running the embedding model + LLM + ERP system pushes CPU to 99°C. Three mitigations:
| FIX | IMPACT |
|---|---|
num_ctx: 2048 (was 32K default) | Massive reduction in KV cache memory and compute |
num_predict: 800 (was 2000) | Model generates less = less sustained GPU load |
| 1s sleep between batch queries | Lets passive cooling catch up between requests |
For production use, this should run on the ERP server (with proper cooling), not the development laptop.
14. COMPLETE CODEBASE
15. FINAL RESULTS & FINDINGS
What We Learned
- Less is more for small models. A complex system prompt destroys a 0.6B model. A slim one lets it think. 3 chunks beat 5 chunks. Shorter context = fewer hallucinations.
- Chain-of-thought is the secret weapon. qwen3:0.6b's internal thinking is why it competes with 3B+ models. Killing it (via /nothink) would cripple quality.
- Hybrid search is non-negotiable. BM25 alone misses semantic matches. Vector search alone misses exact keywords. RRF fusion combines them cleanly.
- The reranker is a trap. Sounds great in theory. In practice: 90% of pipeline latency for marginal quality gain. Killed.
- Retry-on-refusal is cheap and effective. Only triggers ~10% of queries. Adds 3.5s only when needed. Recovers answers that would otherwise be "I don't know."
- Local matches cloud. A 522MB model on a laptop ties with GPT-4o-mini on answer quality. The engineering matters more than the model size.
- Thermal awareness is real engineering. On fanless hardware,
num_ctxandnum_predictsettings directly affect whether the machine throttles.
What's Next
- FastAPI service for the ERP frontend to call
- WebSocket streaming for real-time answer display
- User feedback loop to improve retrieval
- Deploy to the ERP server (proper cooling, dedicated resources)