ACTIVE RESEARCH • FEB 2026

LOCAL RAG SYSTEM FOR ERP

Building a fully local, private RAG (Retrieval-Augmented Generation) system that answers questions about the Solen ERP using a 522MB language model. Zero cloud dependency. Zero cost. Runs on a MacBook Air.

CONTENTS

1. System Overview 2. Architecture & Pipeline 3. Data Ingestion 4. BM25 Keyword Index 5. ChromaDB Vector Store 6. Hybrid Search & RRF 7. Cross-Encoder Reranker (Killed) 8. The Grand Model Tournament 9. Generation Layer 10. Prompt Engineering 11. Fallback Retry Mechanism 12. Benchmarks vs OpenAI 13. Thermal Management 14. Complete Codebase 15. Final Results & Findings

1. SYSTEM OVERVIEW

The ERP system has 8 modules, 200+ API endpoints, 50+ database tables, and bilingual documentation (EN/TR). We built a RAG system that lets any user — technical or not — ask questions in natural language and get accurate, sourced answers.

522MB

MODEL SIZE (qwen3:0.6b)

~3.5s

AVG RESPONSE TIME

106ms

AVG RETRIEVAL TIME

COST PER QUERY

309

INDEXED CHUNKS

~90

TOKENS/SEC

2. ARCHITECTURE & PIPELINE

User Query
Any language, any charset. "projeksiyon nasil calisiyor"

↓

BM25 Keyword Search
rank_bm25 · custom bilingual tokenizer · 309 documents

~20ms

↓ parallel

ChromaDB Vector Search
BAAI/bge-m3 embeddings · 1024-dim · cosine similarity

~80ms

↓

Reciprocal Rank Fusion (RRF)
k=60 · merges keyword + semantic rankings

<1ms

↓

Cross-Encoder Reranker
KILLED — 7s latency for marginal quality gain

REMOVED

↓

Top-3 Chunks → Prompt
Lean context format · [Module > Section] tags · best-first ordering

↓

qwen3:0.6b Generation
Chain-of-thought enabled · num_ctx=2048 · num_predict=800

~3.5s

↓ "no info" detected?

Fallback Retry with 5 Chunks
Automatic · only triggers when model says "I don't know"

+3.5s

↓

Answer + Sources
User-friendly language · no jargon · breadcrumb citations

3. DATA INGESTION

The ERP documentation exists as HTML files — one per module, bilingual (EN/TR). The ingestion pipeline parses these into structured chunks with rich metadata.

Ingestion Pipeline

HTML Parsing — html_parser.py extracts sections from the documentation HTML files, preserving heading hierarchy, code blocks, tables, and API endpoint patterns.
Chunking — chunker.py splits sections into chunks (target ~500 tokens each). Each chunk carries: module, language, breadcrumb, section_id, has_api_endpoints, has_table, has_code, db_tables, api_endpoints.
Output — 309 chunks saved as all_chunks.json with a manifest.json summarizing the ingestion.

Quality Assurance: 3 Test Levels

TEST	FILE	CHECKS	RESULT
L1: Data Quality	test_data_quality.py	Schema, types, ranges, duplicates	ALL PASS
L2: Deep Quality	test_data_quality_l2.py	Metadata coherence, cross-references	ALL PASS
Final Boss	test_final_boss.py	66 ground-truth assertions	66/66 PASS

4. BM25 KEYWORD INDEX

Okapi BM25 for keyword-level matching. Critical for exact terms like table names (half_product_stock), API paths, and Turkish technical terms that embedding models may miss.

Custom Tokenizer

Built a bilingual tokenizer that:

Preserves underscores (critical for table_names and api_endpoints)
Handles Turkish characters (ç, ğ, ι, ö, ş, ü)
Removes EN + TR stop words (70+ words combined)
Minimum token length: 2 characters

309

DOCUMENTS

8,447

VOCABULARY SIZE

~20ms

SEARCH LATENCY

5. CHROMADB VECTOR STORE

Semantic search using BAAI/bge-m3 (multilingual, 1024-dimensional embeddings) stored in a persistent ChromaDB collection with cosine similarity.

The $contains Bug

ChromaDB 1.5.x's $contains operator does NOT perform substring matching despite its name. Discovered this through test failures. Fixed by implementing Python-side substring resolution in _resolve_modules() that converts to $in with exact module names.

1024

EMBEDDING DIMENSIONS

~80ms

SEARCH LATENCY

~2.3GB

EMBEDDING MODEL RAM

6. HYBRID SEARCH & RECIPROCAL RANK FUSION

BM25 catches exact keywords. ChromaDB catches semantic meaning. Neither alone is sufficient. RRF merges their ranked lists into a single ranking without needing to normalize their incompatible score scales.

RRF Formula

score(d) = Σ 1 / (k + rank_i(d))    where k = 60

Documents that appear in both rankings get boosted. Documents that appear in only one still contribute. The k=60 constant (from Cormack et al., 2009) prevents top-1 results from dominating.

Retrieval Final Boss Test

A 6-phase test suite validating the full BM25 + ChromaDB pipeline:

PHASE	DESCRIPTION	RESULT
1. BM25 Ground Truth	Known queries must find known chunks	PASS
2. ChromaDB Semantic	Meaning-based queries find relevant docs	PASS
3. Cross-Engine Consistency	Both engines agree on top results	PASS
4. Filter Integrity	Language/module filters work correctly	PASS
5. Edge Cases	Empty queries, special chars, long queries	PASS
6. Latency	Search completes under 500ms	PASS

7. CROSS-ENCODER RERANKER KILLED

We built a cross-encoder reranking stage using BAAI/bge-reranker-v2-m3 (~1.1GB). It sees query+document together for more accurate scoring than bi-encoder similarity alone.

Killed after A/B testing showed 90% of retrieval time for marginal quality improvement. On a MacBook Air with no fan, 7 seconds per query is unacceptable.

A/B Test Results

METRIC	WITHOUT RERANKER	WITH RERANKER
Retrieval latency	~150ms	~7,000ms
Quality (manual eval)	Good	Slightly better
RAM usage	~2.3GB	~3.4GB (+1.1GB)

The reranker code remains in retrieve/reranker.py for future use on a server with proper cooling. For interactive local use: killed.

8. THE GRAND MODEL TOURNAMENT

Before choosing the generation model, we ran a rigorous tournament: 9 models, 29 test cases, covering hallucination resistance, Turkish language, precision, instruction following, multi-hop reasoning, and adversarial prompts.

Contestants

MODEL	SIZE	PASSED	AVG LATENCY	TOK/S
gemma3:4b	3.3 GB	27/29	4,831ms	62
qwen3:4b	2.6 GB	26/29	3,773ms	82
phi4-mini	2.5 GB	25/29	4,226ms	74
qwen3:1.7b	1.1 GB	25/29	2,175ms	110
qwen3:0.6b ← CHOSEN	522 MB	24/29	1,488ms	125
qwen2.5:3b	1.9 GB	23/29	3,432ms	82
llama3.2:3b	2.0 GB	22/29	2,861ms	94
gemma3:1b	815 MB	21/29	1,220ms	143
llama3.2:1b	1.3 GB	19/29	1,306ms	133

Why qwen3:0.6b?

24/29 tests passed — only 3 fewer than the 4B champion, at 1/6th the size
522MB — fits in memory alongside the ERP system and embedding model
125 tok/s — fastest quality model in the tournament
Chain-of-thought — uses <think> tokens internally, punching above its weight

Test Categories (29 tests)

CATEGORY	COUNT	WHAT IT TESTS
HALL (Hallucination)	3	Refuses to answer when context lacks info
TR (Turkish)	3	Turkish input/output without special chars
ADV (Adversarial)	3	Prompt injection, override attempts
NUM (Precision)	3	Exact numbers, counts, specific values
LONG (Long Context)	3	Multi-paragraph reasoning
INST (Instructions)	3	Format compliance (list, single-line, etc.)
HOP (Multi-hop)	3	Chaining facts across context sections
AMB (Ambiguity)	3	Contradictory info, missing data handling
CODE (Technical)	3	SQL, API path extraction, code understanding
DEG (Degenerate)	2	Empty context, gibberish input

9. GENERATION LAYER

The generation layer wraps Ollama's API with thermal-aware settings optimized for the M4 MacBook Air.

LLM Configuration

PARAMETER	VALUE	REASON
model	qwen3:0.6b	Best quality-to-size ratio from tournament
temperature	0.1	Low creativity, high accuracy for RAG
num_predict	800	Room for thinking tokens + answer
num_ctx	2048	We only use ~1000 tokens input; reduces CPU load

The Thinking Token Discovery

qwen3 models use internal <think>...</think> chains. Ollama strips these from visible output but still counts them in eval_count. With num_predict=400, some answers were empty because all tokens went to thinking.

Solution: increase to 800. The model self-regulates — simple questions use ~150 tokens total, complex ones use ~400. The higher ceiling only activates when needed.

167

AVG TOKENS (SIMPLE Q)

400

AVG TOKENS (COMPLEX Q)

Self-regulates

MODEL STOPS WHEN DONE

10. PROMPT ENGINEERING

Three iterations of system prompt design, each driven by test failures.

Iteration 1: Developer Mode (Abandoned)

"Answer from context only. If not found, say 'Not in context.' Cite sources as [Module > Section]. Be concise."

Problem: Answers were too technical. Users aren't developers.

Iteration 2: Heavy Rules (Failed)

"You are a friendly ERP helper for office workers.
Rules:
1) Explain step-by-step: which screen, which button, what to type.
2) NEVER say: API, endpoint, database, JWT, token, backend...
3) If context mentions '/api/...' translate it..."

Problem: Catastrophic. The 0.6B model couldn't handle the complex system prompt. Three different questions all returned "Go to Suppliers page and click Delete" — the model latched onto a template and repeated it regardless of the question.

Iteration 3: Slim Prompt (Current)

"You help office workers use the ERP system. Use simple language. No technical jargon. If not found, say 'I don't have information about that.'"

Key insight: For a 0.6B model, less instruction = better output. A heavy prompt chokes the tiny brain. A light one lets it use its capacity for actual reasoning.

Context Format

--- [Stock & Inventory > Projeksiyon Data Flow] (en) ---
The core of Projeksiyon lives in ProjeksiyonService.get_projeksiyon().
It reads every non-cancelled Work Card and Material Order...

--- [Hammadde > Supplier Management] (tr) ---
DELETE /api/suppliers/{id}/hard-delete endpoint...

Each chunk gets a lean [Module > Section] (lang) tag. Chunks ordered by relevance (best first) to combat the model's last-chunk-blindness.

11. FALLBACK RETRY MECHANISM

When the model says "I don't have information about that," the system automatically retries with 5 chunks instead of 3. This catches cases where the answer lives in chunk #4 or #5.

Detection Patterns

NO_CONTEXT_PATTERNS = [
    "not in context",
    "don't have information",
    "bilgim yok",
    "bağlamda .* yok",
    "bulunamadı",
    ...
]

Why Not Always Use 5 Chunks?

A/B testing showed that 5 chunks reduces quality for the 0.6B model:

METRIC	3 CHUNKS	5 CHUNKS
Correct answers	9/10	8/10
Hallucinations	0	2
Avg speed	3.7s	4.3s

More context = more confusion for a tiny model. The irrelevant chunks become noise. With 5 chunks, the model hallucinated on two questions that it answered correctly (or honestly refused) with 3 chunks.

The retry mechanism gives us the best of both worlds: precision-first with 3 chunks, recall as fallback with 5.

12. BENCHMARKS VS OPENAI

Same 10 questions, same 3 chunks of context, same system prompt. Local qwen3:0.6b vs OpenAI cloud models.

MODEL	CORRECT	HALLUCINATIONS	AVG LATENCY	COST/QUERY	OFFLINE
qwen3:0.6b (local)	8/10	1	3,491ms	$0	YES
gpt-4.1-nano	7/10	0	2,087ms	~$0.0002	NO
gpt-4o-mini	8/10	0	2,564ms	~$0.0002	NO
gpt-4.1-mini	8/10	0	3,039ms	~$0.0003	NO

Key Findings

Quality is tied. A 522MB local model matches GPT-4o-mini at 8/10 correct answers.
Different failure modes: Cloud models are more conservative (refuse when uncertain). Qwen is more aggressive (infers, but occasionally hallucinates).
Latency is comparable. Local 3.5s vs cloud 2-3s. Network overhead vs faster compute — essentially the same user experience.
The killer difference: Local is free, offline, private. ERP data never leaves the machine.

GPT-5 series models (reasoning models with internal chain-of-thought) were tested but required 2000+ completion tokens and 5-10s per query. Not practical for RAG.

13. THERMAL MANAGEMENT

The M4 MacBook Air has passive cooling (no fan). Running the embedding model + LLM + ERP system pushes CPU to 99°C. Three mitigations:

FIX	IMPACT
`num_ctx: 2048` (was 32K default)	Massive reduction in KV cache memory and compute
`num_predict: 800` (was 2000)	Model generates less = less sustained GPU load
1s sleep between batch queries	Lets passive cooling catch up between requests

For production use, this should run on the ERP server (with proper cooling), not the development laptop.

14. COMPLETE CODEBASE

erp_rag/ ├── __init__.py ├── cli.py CLI: ask + chat modes ├── config.yaml ├── ingest_cli.py │ ├── generate/ Generation layer │ ├── answerer.py Full RAG pipeline with retry │ ├── llm.py Ollama LLM abstraction │ ├── prompts.py User-friendly prompt templates │ ├── test_model_tournament.py 9 models × 29 tests │ └── test_qwen_stress.py Stress test suite │ ├── retrieve/ Retrieval layer │ ├── hybrid_search.py BM25 + ChromaDB + RRF │ ├── pipeline.py Full pipeline orchestration │ └── reranker.py Cross-encoder (disabled) │ ├── index/ Search indices │ ├── bm25_index.py BM25Okapi with bilingual tokenizer │ ├── vector_store.py ChromaDB + bge-m3 embeddings │ ├── test_bm25.py │ ├── test_chroma.py │ └── test_final_boss_retrieval.py │ ├── ingest/ Data ingestion │ ├── chunker.py HTML → structured chunks │ ├── html_parser.py Section extraction │ ├── test_data_quality.py │ ├── test_data_quality_l2.py │ └── test_final_boss.py 66 ground-truth assertions │ └── data/ Persisted data ├── tournament_results.json ├── chunks/ │ ├── manifest.json │ └── all_chunks.json 309 chunks ├── bm25_index.pkl └── chroma_db/ Persistent vector store

15. FINAL RESULTS & FINDINGS

9/10

CORRECT ANSWERS (USER MODE)

HALLUCINATIONS (3 CHUNKS)

~3.5s

AVG TOTAL LATENCY

~3GB

TOTAL RAM USAGE

What We Learned

Less is more for small models. A complex system prompt destroys a 0.6B model. A slim one lets it think. 3 chunks beat 5 chunks. Shorter context = fewer hallucinations.
Chain-of-thought is the secret weapon. qwen3:0.6b's internal thinking is why it competes with 3B+ models. Killing it (via /nothink) would cripple quality.
Hybrid search is non-negotiable. BM25 alone misses semantic matches. Vector search alone misses exact keywords. RRF fusion combines them cleanly.
The reranker is a trap. Sounds great in theory. In practice: 90% of pipeline latency for marginal quality gain. Killed.
Retry-on-refusal is cheap and effective. Only triggers ~10% of queries. Adds 3.5s only when needed. Recovers answers that would otherwise be "I don't know."
Local matches cloud. A 522MB model on a laptop ties with GPT-4o-mini on answer quality. The engineering matters more than the model size.
Thermal awareness is real engineering. On fanless hardware, num_ctx and num_predict settings directly affect whether the machine throttles.

What's Next

FastAPI service for the ERP frontend to call
WebSocket streaming for real-time answer display
User feedback loop to improve retrieval
Deploy to the ERP server (proper cooling, dedicated resources)