Local Claude Code Setup with Qwen3.5 27B via llama.cpp

✍️ OpenClawRadar📅 Published: April 14, 2026🔗 Source
Local Claude Code Setup with Qwen3.5 27B via llama.cpp
Ad

Local Claude Code Configuration

A developer documented their setup for running Claude Code completely offline using a local LLM with llama.cpp. The system uses Qwen3.5 27B quantized with unsloth/UD-Q4_K_XL on Arch Linux with Strix Halo hardware.

Environment Configuration

To disable telemetry and make Claude Code fully offline, the following environment variables were set in ~/.bashrc:

export ANTHROPIC_BASE_URL="http://127.0.0.1:8001"
export ANTHROPIC_API_KEY="not-set"
export ANTHROPIC_AUTH_TOKEN="not-set"
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
export CLAUDE_CODE_ENABLE_TELEMETRY=0
export DISABLE_AUTOUPDATER=1
export DISABLE_TELEMETRY=1
export CLAUDE_CODE_DISABLE_1M_CONTEXT=1
export CLAUDE_CODE_MAX_OUTPUT_TOKENS=4096
export CLAUDE_CODE_AUTO_COMPACT_WINDOW=32768

The developer noted that using claude/settings.json is more stable and controllable than environment variables.

llama.cpp Server Configuration

The llama.cpp server was launched with these parameters:

ROCBLAS_USE_HIPBLASLT=1 ./build/bin/llama-server \
--model models/Qwen3.5-27B-Q4_K_M.gguf \
--alias "qwen3.5-27b" \
--port 8001 --ctx-size 65536 --n-gpu-layers 999 \
--flash-attn on --jinja --threads 8 \
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 \
--cache-type-k q8_0 --cache-type-v q8_0

The ROCBLAS_USE_HIPBLASLT=1 flag was required for Strix Halo hardware, and the developer emphasized researching specific hardware to specialize llama.cpp setup.

Ad

Performance Benchmarks

Seven runs were conducted with the following results:

  • Run 1 (File operations): 1m44s, 9.71 tokens/second, 23K context, correct output
  • Run 2 (Git clone + code read): 2m31s, 9.56 t/s, 32.5K context, excellent quality
  • Run 3 (7-day plan + guide): 4m57s, 8.37 t/s, 37.9K context, excellent quality
  • Run 4 (Skills assessment): 4m36s, 8.46 t/s, 40K context, very good quality (web search broken)
  • Run 5 (Write Python script): 10m25s, 7.54 t/s, 60.4K context, good quality (7/10)
  • Run 6 (Code review + fix): 9m29s, 7.42 t/s, 65,535 context (CRASH), very good quality (8.5/10)
  • Run 7 (/compact command): ~10m, ~8.07 t/s, 66,680 context (failed), N/A quality

Key Findings

  • Generation speed degraded approximately 24% across the context range: from 9.71 t/s at 23K context down to 7.42 t/s at 65K context
  • Claude Code system prompt consumes 22,870 tokens (35% of the 65K budget)
  • Auto-compaction was completely broken: Claude Code assumed 200K context, so the 95% threshold was 190K, but the 65K limit was hit at 33% of what Claude Code thought was the window
  • The /compact command needs output headroom: with 4096 max output tokens, the compaction summary couldn't fit, requiring 16K+ tokens
  • Web search functionality is broken without Anthropic connectivity; potential solutions include SearXNG via MCP

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also