Fine-tuned Qwen3.5-2B with RAG-Engram architecture improves grounded answer accuracy from 50% to 93% at 8K context

Fine-tuning approach for improved RAG performance
A developer has created a fine-tuned version of Qwen3.5-2B that addresses the 'lost in the middle' phenomenon and hallucinations in small language models when context windows are saturated with approximately 8K tokens of retrieved data. The custom architecture, called RAG-Engram, improved correct answers at 8K tokens from 50% to 93% across 14 real-world queries.
Architecture details
The RAG-Engram system is a two-level system built around Qwen3.5-2B's hybrid Gated DeltaNet architecture:
- Level 1 — Static Engram Table: 135K pre-computed entity embeddings (Indian proper nouns, government schemes, Hindi phrases, financial terms) stored in CPU RAM. This frees up the model's attention from having to reconstruct known entities.
- Level 2 — Dynamic Chunk Navigation: At inference time, a lightweight spaCy extractor (~15MB) scans retrieved chunks, builds a pointer map of where key entities appear, and generates an attention bias matrix. This gets added to Q·K^T scores before softmax at layers 3 and 15 (the full-attention layers in the hybrid architecture — the other 18 layers are Gated DeltaNet which don't have softmax attention).
The approach tells attention heads where to look instead of having the model blindly scan 8,000 tokens hoping to find answers.
Training specifications
- Base model: Qwen3.5-2B-Base
- Method: LoRA (r=16, alpha=16) via Unsloth
- Data: 2,168 examples distilled from DeepSeek V3 across MS MARCO, TyDi QA, NQ Open, MLQA Hindi, IndicQA, Dolly-15K
- Training time: 15 minutes on Modal (single GPU)
- Train/Val loss: 1.369 / 1.385 — no overfitting
The supervised fine-tuning teaches the model to answer in a specific conversational style (markdown, bold key insights, source grounding), while the Engram bias handles attention navigation at long contexts.
Evaluation results
Evaluation was conducted by Claude Opus 4.6 using Google search result chunks padded to 8K tokens:
- Vanilla Qwen3.5-2B: 50% correct answers at 8K tokens, 14% failures/refusals
- Drissy + RAG-Engram: 93% correct answers at 8K tokens, 0% failures/refusals
The combination eliminated 'lost in the middle' failures completely. The developer reports the entire project from spec to HuggingFace took about 2 weeks and cost less than a coffee.
Model availability
The fine-tuned model is available as:
- Model: drissea-ai/drissy-qwen3.5-2b
- GGUF: drissea-ai/drissy-qwen3.5-2b-GGUF
📖 Read the full source: r/LocalLLaMA
👀 See Also

Hands-On with Tencent's Model: Strong for Agentic Workflows, Weak for Complex Coding
Tencent's model scores 8/10 for agentic tasks with low hallucination rates, but fails on complex coding like Notion API schemas. Avoid for backend logic.

Open Source MCP Server Connects Claude to Mailchimp API
A developer built a Mailchimp MCP server using Claude Code, providing 53 tools for campaigns, audiences, reports, automations, and e-commerce with built-in safety modes and read-only configuration.

SkyClaw Adds Encrypted Chat-Based API Key Setup for AI Agents
SkyClaw implements AES-256-GCM encrypted key ingestion through chat, intercepting key commands at the system layer so the LLM never sees API keys and using one-time key encryption so messaging platforms only see ciphertext.

AgentMind: A Claude Code Plugin That Learns and Applies Your Coding Preferences
AgentMind is a Claude Code plugin that observes your coding patterns, learns preferences like tool choices and style rules, and automatically injects that context into future sessions. It uses a six-step core loop and confidence scoring to determine when to apply learned preferences.