Reasoning Guard: Proxy-Level Loop Detection for Local LLM Inference

A developer running Qwen3.6 MoE behind a vLLM proxy hit a common reliability issue: runaway reasoning loops where the model repeats itself inside a reasoning block, burning tokens and stalling agents. At 180+ tokens/sec, even a 20–30 second loop wastes GPU time and blocks client requests. They built a lightweight guard that lives in the proxy layer and enforces deterministic checks on the streaming output before it reaches the client.
Architecture
Client → Proxy → vLLM → Model
The proxy intercepts the streaming response as it leaves vLLM. It does not modify model weights, call a second LLM, or use embeddings or semantic analysis. All checks are cheap and deterministic.
What It Checks
- Reasoning token caps (configurable per effort level)
- Repeated paragraph detection
- Sliding-window n-gram repetition
- Repeated sentence fingerprinting
- Fuzzy opening-pattern detection (catches loops like “Actually, I think I’ve found it…”)
- Cut-and-continue recovery path
Recovery Flow
When the guard triggers, it:
- Stops the upstream stream
- Captures the reasoning produced so far
- Reissues the request with that reasoning baked in as prior assistant context
- Disables thinking for the continuation
- Merges phase 1 and phase 2 usage stats
Because vLLM prefix caching is already active, the continuation is effectively seamless. Phase 2 usually resumes with ~50–100ms TTFT, so the client sees reasoning flow directly into the final answer instead of hanging.
Observability
The proxy logs each trigger with:
- Whether the guard fired
- Trigger reason
- Token cap used
- Reasoning token count
- Merged total usage
- Stream-end metadata
Result
Before: occasional 2000+ token reasoning blocks that went nowhere. After: the model still reasons when useful, but runaway thinking gets cut and redirected into an answer. The author describes it as a “proxy-level seatbelt for local LLM inference.”
No model surgery, no extra LLM calls — just stream interception, token counting, loop detection, and a clean recovery path. The guard has been validated end-to-end through the live proxy against real trace logs.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Indie Developer Unveils 'Ideanator' CLI Tool for Structuring Vague Ideas with Local LLMs
Ideanator is a CLI tool designed by a self-taught 19-year-old developer using local LLMs like Ollama/MLX. It structures vague ideas into well-defined concepts, completely offline.

ARP: Stateless WebSocket Relay for Autonomous Agent Communication
ARP (Agent Relay Protocol) is a stateless WebSocket relay for autonomous agent communication featuring Ed25519 identity, HPKE encryption per RFC 9180, binary TLV framing, and 33 bytes overhead per message. No accounts or registration required—just generate a keypair and connect.

Claude Code: How to Connect Your AI-Built Frontend to a Real Backend
Claude Code builds polished frontends but often uses hardcoded data. Here are four ways to connect it to real backends: raw APIs, SDKs, CLIs, and MCP.

ddash: Mermaid Diagram Tool with URL-Based Storage and Claude Code Integration
ddash is a free Mermaid diagram tool where the entire diagram is compressed into the URL hash, requiring no backend, accounts, or storage. It includes a Claude Code skill that lets you generate and open diagrams directly during conversations with commands like /diagram the auth flow.