ACTIVE RESEARCH • FEB 2026

LOCAL RAG SYSTEM FOR ERP

Building a fully local, private RAG (Retrieval-Augmented Generation) system that answers questions about the Solen ERP using a 522MB language model. Zero cloud dependency. Zero cost. Runs on a MacBook Air.

CONTENTS
1. System Overview 2. Architecture & Pipeline 3. Data Ingestion 4. BM25 Keyword Index 5. ChromaDB Vector Store 6. Hybrid Search & RRF 7. Cross-Encoder Reranker (Killed) 8. The Grand Model Tournament 9. Generation Layer 10. Prompt Engineering 11. Fallback Retry Mechanism 12. Benchmarks vs OpenAI 13. Thermal Management 14. Complete Codebase 15. Final Results & Findings

1. SYSTEM OVERVIEW

The ERP system has 8 modules, 200+ API endpoints, 50+ database tables, and bilingual documentation (EN/TR). We built a RAG system that lets any user — technical or not — ask questions in natural language and get accurate, sourced answers.

522MB
MODEL SIZE (qwen3:0.6b)
~3.5s
AVG RESPONSE TIME
106ms
AVG RETRIEVAL TIME
$0
COST PER QUERY
309
INDEXED CHUNKS
~90
TOKENS/SEC

2. ARCHITECTURE & PIPELINE

Q
User Query
Any language, any charset. "projeksiyon nasil calisiyor"
1
BM25 Keyword Search
rank_bm25 · custom bilingual tokenizer · 309 documents
~20ms
↓ parallel
2
ChromaDB Vector Search
BAAI/bge-m3 embeddings · 1024-dim · cosine similarity
~80ms
3
Reciprocal Rank Fusion (RRF)
k=60 · merges keyword + semantic rankings
<1ms
4
Cross-Encoder Reranker
KILLED — 7s latency for marginal quality gain
REMOVED
5
Top-3 Chunks → Prompt
Lean context format · [Module > Section] tags · best-first ordering
6
qwen3:0.6b Generation
Chain-of-thought enabled · num_ctx=2048 · num_predict=800
~3.5s
↓ "no info" detected?
R
Fallback Retry with 5 Chunks
Automatic · only triggers when model says "I don't know"
+3.5s
A
Answer + Sources
User-friendly language · no jargon · breadcrumb citations

3. DATA INGESTION

The ERP documentation exists as HTML files — one per module, bilingual (EN/TR). The ingestion pipeline parses these into structured chunks with rich metadata.

Ingestion Pipeline

  1. HTML Parsinghtml_parser.py extracts sections from the documentation HTML files, preserving heading hierarchy, code blocks, tables, and API endpoint patterns.
  2. Chunkingchunker.py splits sections into chunks (target ~500 tokens each). Each chunk carries: module, language, breadcrumb, section_id, has_api_endpoints, has_table, has_code, db_tables, api_endpoints.
  3. Output — 309 chunks saved as all_chunks.json with a manifest.json summarizing the ingestion.

Quality Assurance: 3 Test Levels

TESTFILECHECKSRESULT
L1: Data Qualitytest_data_quality.pySchema, types, ranges, duplicatesALL PASS
L2: Deep Qualitytest_data_quality_l2.pyMetadata coherence, cross-referencesALL PASS
Final Bosstest_final_boss.py66 ground-truth assertions66/66 PASS

4. BM25 KEYWORD INDEX

Okapi BM25 for keyword-level matching. Critical for exact terms like table names (half_product_stock), API paths, and Turkish technical terms that embedding models may miss.

Custom Tokenizer

Built a bilingual tokenizer that:

309
DOCUMENTS
8,447
VOCABULARY SIZE
~20ms
SEARCH LATENCY

5. CHROMADB VECTOR STORE

Semantic search using BAAI/bge-m3 (multilingual, 1024-dimensional embeddings) stored in a persistent ChromaDB collection with cosine similarity.

The $contains Bug

ChromaDB 1.5.x's $contains operator does NOT perform substring matching despite its name. Discovered this through test failures. Fixed by implementing Python-side substring resolution in _resolve_modules() that converts to $in with exact module names.

1024
EMBEDDING DIMENSIONS
~80ms
SEARCH LATENCY
~2.3GB
EMBEDDING MODEL RAM

6. HYBRID SEARCH & RECIPROCAL RANK FUSION

BM25 catches exact keywords. ChromaDB catches semantic meaning. Neither alone is sufficient. RRF merges their ranked lists into a single ranking without needing to normalize their incompatible score scales.

RRF Formula

score(d) = Σ 1 / (k + rank_i(d))    where k = 60

Documents that appear in both rankings get boosted. Documents that appear in only one still contribute. The k=60 constant (from Cormack et al., 2009) prevents top-1 results from dominating.

Retrieval Final Boss Test

A 6-phase test suite validating the full BM25 + ChromaDB pipeline:

PHASEDESCRIPTIONRESULT
1. BM25 Ground TruthKnown queries must find known chunksPASS
2. ChromaDB SemanticMeaning-based queries find relevant docsPASS
3. Cross-Engine ConsistencyBoth engines agree on top resultsPASS
4. Filter IntegrityLanguage/module filters work correctlyPASS
5. Edge CasesEmpty queries, special chars, long queriesPASS
6. LatencySearch completes under 500msPASS

7. CROSS-ENCODER RERANKER KILLED

We built a cross-encoder reranking stage using BAAI/bge-reranker-v2-m3 (~1.1GB). It sees query+document together for more accurate scoring than bi-encoder similarity alone.

Killed after A/B testing showed 90% of retrieval time for marginal quality improvement. On a MacBook Air with no fan, 7 seconds per query is unacceptable.

A/B Test Results

METRICWITHOUT RERANKERWITH RERANKER
Retrieval latency~150ms~7,000ms
Quality (manual eval)GoodSlightly better
RAM usage~2.3GB~3.4GB (+1.1GB)

The reranker code remains in retrieve/reranker.py for future use on a server with proper cooling. For interactive local use: killed.

8. THE GRAND MODEL TOURNAMENT

Before choosing the generation model, we ran a rigorous tournament: 9 models, 29 test cases, covering hallucination resistance, Turkish language, precision, instruction following, multi-hop reasoning, and adversarial prompts.

Contestants

MODELSIZEPASSEDAVG LATENCYTOK/S
gemma3:4b3.3 GB27/294,831ms62
qwen3:4b2.6 GB26/293,773ms82
phi4-mini2.5 GB25/294,226ms74
qwen3:1.7b1.1 GB25/292,175ms110
qwen3:0.6b ← CHOSEN522 MB24/291,488ms125
qwen2.5:3b1.9 GB23/293,432ms82
llama3.2:3b2.0 GB22/292,861ms94
gemma3:1b815 MB21/291,220ms143
llama3.2:1b1.3 GB19/291,306ms133

Why qwen3:0.6b?

Test Categories (29 tests)

CATEGORYCOUNTWHAT IT TESTS
HALL (Hallucination)3Refuses to answer when context lacks info
TR (Turkish)3Turkish input/output without special chars
ADV (Adversarial)3Prompt injection, override attempts
NUM (Precision)3Exact numbers, counts, specific values
LONG (Long Context)3Multi-paragraph reasoning
INST (Instructions)3Format compliance (list, single-line, etc.)
HOP (Multi-hop)3Chaining facts across context sections
AMB (Ambiguity)3Contradictory info, missing data handling
CODE (Technical)3SQL, API path extraction, code understanding
DEG (Degenerate)2Empty context, gibberish input

9. GENERATION LAYER

The generation layer wraps Ollama's API with thermal-aware settings optimized for the M4 MacBook Air.

LLM Configuration

PARAMETERVALUEREASON
modelqwen3:0.6bBest quality-to-size ratio from tournament
temperature0.1Low creativity, high accuracy for RAG
num_predict800Room for thinking tokens + answer
num_ctx2048We only use ~1000 tokens input; reduces CPU load

The Thinking Token Discovery

qwen3 models use internal <think>...</think> chains. Ollama strips these from visible output but still counts them in eval_count. With num_predict=400, some answers were empty because all tokens went to thinking.

Solution: increase to 800. The model self-regulates — simple questions use ~150 tokens total, complex ones use ~400. The higher ceiling only activates when needed.

167
AVG TOKENS (SIMPLE Q)
400
AVG TOKENS (COMPLEX Q)
Self-regulates
MODEL STOPS WHEN DONE

10. PROMPT ENGINEERING

Three iterations of system prompt design, each driven by test failures.

Iteration 1: Developer Mode (Abandoned)

"Answer from context only. If not found, say 'Not in context.' Cite sources as [Module > Section]. Be concise."

Problem: Answers were too technical. Users aren't developers.

Iteration 2: Heavy Rules (Failed)

"You are a friendly ERP helper for office workers.
Rules:
1) Explain step-by-step: which screen, which button, what to type.
2) NEVER say: API, endpoint, database, JWT, token, backend...
3) If context mentions '/api/...' translate it..."

Problem: Catastrophic. The 0.6B model couldn't handle the complex system prompt. Three different questions all returned "Go to Suppliers page and click Delete" — the model latched onto a template and repeated it regardless of the question.

Iteration 3: Slim Prompt (Current)

"You help office workers use the ERP system. Use simple language. No technical jargon. If not found, say 'I don't have information about that.'"

Key insight: For a 0.6B model, less instruction = better output. A heavy prompt chokes the tiny brain. A light one lets it use its capacity for actual reasoning.

Context Format

--- [Stock & Inventory > Projeksiyon Data Flow] (en) ---
The core of Projeksiyon lives in ProjeksiyonService.get_projeksiyon().
It reads every non-cancelled Work Card and Material Order...

--- [Hammadde > Supplier Management] (tr) ---
DELETE /api/suppliers/{id}/hard-delete endpoint...

Each chunk gets a lean [Module > Section] (lang) tag. Chunks ordered by relevance (best first) to combat the model's last-chunk-blindness.

11. FALLBACK RETRY MECHANISM

When the model says "I don't have information about that," the system automatically retries with 5 chunks instead of 3. This catches cases where the answer lives in chunk #4 or #5.

Detection Patterns

NO_CONTEXT_PATTERNS = [
    "not in context",
    "don't have information",
    "bilgim yok",
    "bağlamda .* yok",
    "bulunamadı",
    ...
]

Why Not Always Use 5 Chunks?

A/B testing showed that 5 chunks reduces quality for the 0.6B model:

METRIC3 CHUNKS5 CHUNKS
Correct answers9/108/10
Hallucinations02
Avg speed3.7s4.3s

More context = more confusion for a tiny model. The irrelevant chunks become noise. With 5 chunks, the model hallucinated on two questions that it answered correctly (or honestly refused) with 3 chunks.

The retry mechanism gives us the best of both worlds: precision-first with 3 chunks, recall as fallback with 5.

12. BENCHMARKS VS OPENAI

Same 10 questions, same 3 chunks of context, same system prompt. Local qwen3:0.6b vs OpenAI cloud models.

MODELCORRECTHALLUCINATIONSAVG LATENCYCOST/QUERYOFFLINE
qwen3:0.6b (local)8/1013,491ms$0YES
gpt-4.1-nano7/1002,087ms~$0.0002NO
gpt-4o-mini8/1002,564ms~$0.0002NO
gpt-4.1-mini8/1003,039ms~$0.0003NO

Key Findings

GPT-5 series models (reasoning models with internal chain-of-thought) were tested but required 2000+ completion tokens and 5-10s per query. Not practical for RAG.

13. THERMAL MANAGEMENT

The M4 MacBook Air has passive cooling (no fan). Running the embedding model + LLM + ERP system pushes CPU to 99°C. Three mitigations:

FIXIMPACT
num_ctx: 2048 (was 32K default)Massive reduction in KV cache memory and compute
num_predict: 800 (was 2000)Model generates less = less sustained GPU load
1s sleep between batch queriesLets passive cooling catch up between requests

For production use, this should run on the ERP server (with proper cooling), not the development laptop.

14. COMPLETE CODEBASE

erp_rag/ ├── __init__.py ├── cli.py CLI: ask + chat modes ├── config.yaml ├── ingest_cli.py │ ├── generate/ Generation layer │ ├── answerer.py Full RAG pipeline with retry │ ├── llm.py Ollama LLM abstraction │ ├── prompts.py User-friendly prompt templates │ ├── test_model_tournament.py 9 models × 29 tests │ └── test_qwen_stress.py Stress test suite │ ├── retrieve/ Retrieval layer │ ├── hybrid_search.py BM25 + ChromaDB + RRF │ ├── pipeline.py Full pipeline orchestration │ └── reranker.py Cross-encoder (disabled) │ ├── index/ Search indices │ ├── bm25_index.py BM25Okapi with bilingual tokenizer │ ├── vector_store.py ChromaDB + bge-m3 embeddings │ ├── test_bm25.py │ ├── test_chroma.py │ └── test_final_boss_retrieval.py │ ├── ingest/ Data ingestion │ ├── chunker.py HTML → structured chunks │ ├── html_parser.py Section extraction │ ├── test_data_quality.py │ ├── test_data_quality_l2.py │ └── test_final_boss.py 66 ground-truth assertions │ └── data/ Persisted data ├── tournament_results.json ├── chunks/ │ ├── manifest.json │ └── all_chunks.json 309 chunks ├── bm25_index.pkl └── chroma_db/ Persistent vector store

15. FINAL RESULTS & FINDINGS

9/10
CORRECT ANSWERS (USER MODE)
0
HALLUCINATIONS (3 CHUNKS)
~3.5s
AVG TOTAL LATENCY
~3GB
TOTAL RAM USAGE

What We Learned

  1. Less is more for small models. A complex system prompt destroys a 0.6B model. A slim one lets it think. 3 chunks beat 5 chunks. Shorter context = fewer hallucinations.
  2. Chain-of-thought is the secret weapon. qwen3:0.6b's internal thinking is why it competes with 3B+ models. Killing it (via /nothink) would cripple quality.
  3. Hybrid search is non-negotiable. BM25 alone misses semantic matches. Vector search alone misses exact keywords. RRF fusion combines them cleanly.
  4. The reranker is a trap. Sounds great in theory. In practice: 90% of pipeline latency for marginal quality gain. Killed.
  5. Retry-on-refusal is cheap and effective. Only triggers ~10% of queries. Adds 3.5s only when needed. Recovers answers that would otherwise be "I don't know."
  6. Local matches cloud. A 522MB model on a laptop ties with GPT-4o-mini on answer quality. The engineering matters more than the model size.
  7. Thermal awareness is real engineering. On fanless hardware, num_ctx and num_predict settings directly affect whether the machine throttles.

What's Next