LLM Reasoning Loop Detection: Proxy-Level Guard Recovery

A developer running Qwen3.6 MoE behind a vLLM proxy hit a common reliability issue: runaway reasoning loops where the model repeats itself inside a reasoning block, burning tokens and stalling agents. At 180+ tokens/sec, even a 20–30 second loop wastes GPU time and blocks client requests. They built a lightweight guard that lives in the proxy layer and enforces deterministic checks on the streaming output before it reaches the client.

Architecture

Client → Proxy → vLLM → Model

The proxy intercepts the streaming response as it leaves vLLM. It does not modify model weights, call a second LLM, or use embeddings or semantic analysis. All checks are cheap and deterministic.

What It Checks

Reasoning token caps (configurable per effort level)
Repeated paragraph detection
Sliding-window n-gram repetition
Repeated sentence fingerprinting
Fuzzy opening-pattern detection (catches loops like “Actually, I think I’ve found it…”)
Cut-and-continue recovery path

Recovery Flow

When the guard triggers, it:

Stops the upstream stream
Captures the reasoning produced so far
Reissues the request with that reasoning baked in as prior assistant context
Disables thinking for the continuation
Merges phase 1 and phase 2 usage stats

Because vLLM prefix caching is already active, the continuation is effectively seamless. Phase 2 usually resumes with ~50–100ms TTFT, so the client sees reasoning flow directly into the final answer instead of hanging.

Observability

The proxy logs each trigger with:

Whether the guard fired
Trigger reason
Token cap used
Reasoning token count
Merged total usage
Stream-end metadata

Result

Before: occasional 2000+ token reasoning blocks that went nowhere. After: the model still reasons when useful, but runaway thinking gets cut and redirected into an answer. The author describes it as a “proxy-level seatbelt for local LLM inference.”

No model surgery, no extra LLM calls — just stream interception, token counting, loop detection, and a clean recovery path. The guard has been validated end-to-end through the live proxy against real trace logs.

📖 Read the full source: r/LocalLLaMA