Toroidal Logit Bias: Simple Inference-Time Trick Reduces Hallucination by 40%

Researchers have developed a simple logit bias method that reduces factual hallucination without fine-tuning or RAG. The technique can be applied to any local model at inference time.
How It Works
The method maps token IDs to a 12x12 torus (a donut-shaped surface), then boosts logits for tokens that are "near" recent tokens in that toroidal space. Only the first 1-3K tokens are biased — applying it to the full vocabulary degrades performance.
Results
- Qwen 2.5-7B: 40% fewer factual errors
- OLMo 1.7-7B: 15.4% fewer factual errors
- TruthfulQA (817 prompts): +6.8% improvement on Qwen
- Performance cost: ~5% slower generation
Implementation
The core logic is approximately 30 lines of Python. Each model requires its own hyperparameters — Qwen works best with alpha=0.3, radius=2.0, N=1440, while OLMo needs alpha=0.2, radius=3.0, N=3000.
Demo: huggingface.co/spaces/paraxiom-research/topological-coherence
Why This Matters
This advancement in logit bias techniques is significant for the AI agent ecosystem as it addresses the critical issue of factual hallucination, which has been a major hurdle in deploying reliable AI models. By enhancing the accuracy of outputs without extensive retraining, this method can lead to more trustworthy AI applications across various domains, from customer service to content generation.
Key Takeaways
- This method can reduce factual errors significantly, with Qwen showing a 40% improvement.
- It operates at inference time, making it easy to implement without the need for complex fine-tuning.
- The approach is adaptable to various models, each requiring specific hyperparameters for optimal performance.
- While effective, there is a slight trade-off in performance speed, with a ~5% increase in generation time.
Getting Started
To implement the toroidal logit bias method, start by accessing the provided code repository on GitHub. Review the documentation for your specific model to understand the required hyperparameters. After setting up your environment, you can easily integrate the logit bias technique into your existing inference pipeline. For a hands-on experience, check out the demo link to see the method in action.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude 4.6 Opus Reasoning Distilled to 14GB for Apple Silicon via MLX Quantization
A developer has quantized a Qwen 3.5 27B model distilled from Claude 4.6 Opus reasoning trajectories from 55.6GB to 14GB using MLX for Apple Silicon, achieving ~16 tokens/sec on an M4 Pro while maintaining the model's analytical reasoning capabilities.

Chrome Extension Bridges Google Messages to Claude Code via MCP
A developer built a Chrome extension that connects Google Messages Web to Claude Code using MCP with stdio and WebSocket transport. The extension lists chats, reads messages, and drafts replies but currently can't send messages due to Angular's zone.js isolation.

Sense: Go SDK for LLM-powered test assertions and structured text extraction
Sense is a Go SDK that uses Claude for two main functions: evaluating non-deterministic output in tests with plain English assertions, and extracting typed structs from unstructured text through reflection and forced tool_use.

VT Code: Open-Source Rust TUI Coding Agent with Multi-Provider Support and Agent Skills
VT Code is a Rust-based terminal UI (TUI) coding agent supporting Anthropic, OpenAI, Gemini, and Codex, with local inference via LM Studio and Ollama. It includes Agent Skills, Model Context Protocol, and Agent Client Protocol.