Fine-tuned Qwen3.5-2B with RAG-Engram architecture improves grounded answer accuracy from 50% to 93% at 8K context

✍️ OpenClawRadar📅 Published: March 27, 2026🔗 Source
Fine-tuned Qwen3.5-2B with RAG-Engram architecture improves grounded answer accuracy from 50% to 93% at 8K context
Ad

Fine-tuning approach for improved RAG performance

A developer has created a fine-tuned version of Qwen3.5-2B that addresses the 'lost in the middle' phenomenon and hallucinations in small language models when context windows are saturated with approximately 8K tokens of retrieved data. The custom architecture, called RAG-Engram, improved correct answers at 8K tokens from 50% to 93% across 14 real-world queries.

Architecture details

The RAG-Engram system is a two-level system built around Qwen3.5-2B's hybrid Gated DeltaNet architecture:

  • Level 1 — Static Engram Table: 135K pre-computed entity embeddings (Indian proper nouns, government schemes, Hindi phrases, financial terms) stored in CPU RAM. This frees up the model's attention from having to reconstruct known entities.
  • Level 2 — Dynamic Chunk Navigation: At inference time, a lightweight spaCy extractor (~15MB) scans retrieved chunks, builds a pointer map of where key entities appear, and generates an attention bias matrix. This gets added to Q·K^T scores before softmax at layers 3 and 15 (the full-attention layers in the hybrid architecture — the other 18 layers are Gated DeltaNet which don't have softmax attention).

The approach tells attention heads where to look instead of having the model blindly scan 8,000 tokens hoping to find answers.

Ad

Training specifications

  • Base model: Qwen3.5-2B-Base
  • Method: LoRA (r=16, alpha=16) via Unsloth
  • Data: 2,168 examples distilled from DeepSeek V3 across MS MARCO, TyDi QA, NQ Open, MLQA Hindi, IndicQA, Dolly-15K
  • Training time: 15 minutes on Modal (single GPU)
  • Train/Val loss: 1.369 / 1.385 — no overfitting

The supervised fine-tuning teaches the model to answer in a specific conversational style (markdown, bold key insights, source grounding), while the Engram bias handles attention navigation at long contexts.

Evaluation results

Evaluation was conducted by Claude Opus 4.6 using Google search result chunks padded to 8K tokens:

  • Vanilla Qwen3.5-2B: 50% correct answers at 8K tokens, 14% failures/refusals
  • Drissy + RAG-Engram: 93% correct answers at 8K tokens, 0% failures/refusals

The combination eliminated 'lost in the middle' failures completely. The developer reports the entire project from spec to HuggingFace took about 2 weeks and cost less than a coffee.

Model availability

The fine-tuned model is available as:

  • Model: drissea-ai/drissy-qwen3.5-2b
  • GGUF: drissea-ai/drissy-qwen3.5-2b-GGUF

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also