Local Claude Code Setup with Qwen3.5 27B via llama.cpp

Local Claude Code Configuration
A developer documented their setup for running Claude Code completely offline using a local LLM with llama.cpp. The system uses Qwen3.5 27B quantized with unsloth/UD-Q4_K_XL on Arch Linux with Strix Halo hardware.
Environment Configuration
To disable telemetry and make Claude Code fully offline, the following environment variables were set in ~/.bashrc:
export ANTHROPIC_BASE_URL="http://127.0.0.1:8001" export ANTHROPIC_API_KEY="not-set" export ANTHROPIC_AUTH_TOKEN="not-set" export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 export CLAUDE_CODE_ENABLE_TELEMETRY=0 export DISABLE_AUTOUPDATER=1 export DISABLE_TELEMETRY=1 export CLAUDE_CODE_DISABLE_1M_CONTEXT=1 export CLAUDE_CODE_MAX_OUTPUT_TOKENS=4096 export CLAUDE_CODE_AUTO_COMPACT_WINDOW=32768
The developer noted that using claude/settings.json is more stable and controllable than environment variables.
llama.cpp Server Configuration
The llama.cpp server was launched with these parameters:
ROCBLAS_USE_HIPBLASLT=1 ./build/bin/llama-server \ --model models/Qwen3.5-27B-Q4_K_M.gguf \ --alias "qwen3.5-27b" \ --port 8001 --ctx-size 65536 --n-gpu-layers 999 \ --flash-attn on --jinja --threads 8 \ --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 \ --cache-type-k q8_0 --cache-type-v q8_0
The ROCBLAS_USE_HIPBLASLT=1 flag was required for Strix Halo hardware, and the developer emphasized researching specific hardware to specialize llama.cpp setup.
Performance Benchmarks
Seven runs were conducted with the following results:
- Run 1 (File operations): 1m44s, 9.71 tokens/second, 23K context, correct output
- Run 2 (Git clone + code read): 2m31s, 9.56 t/s, 32.5K context, excellent quality
- Run 3 (7-day plan + guide): 4m57s, 8.37 t/s, 37.9K context, excellent quality
- Run 4 (Skills assessment): 4m36s, 8.46 t/s, 40K context, very good quality (web search broken)
- Run 5 (Write Python script): 10m25s, 7.54 t/s, 60.4K context, good quality (7/10)
- Run 6 (Code review + fix): 9m29s, 7.42 t/s, 65,535 context (CRASH), very good quality (8.5/10)
- Run 7 (/compact command): ~10m, ~8.07 t/s, 66,680 context (failed), N/A quality
Key Findings
- Generation speed degraded approximately 24% across the context range: from 9.71 t/s at 23K context down to 7.42 t/s at 65K context
- Claude Code system prompt consumes 22,870 tokens (35% of the 65K budget)
- Auto-compaction was completely broken: Claude Code assumed 200K context, so the 95% threshold was 190K, but the 65K limit was hit at 33% of what Claude Code thought was the window
- The /compact command needs output headroom: with 4096 max output tokens, the compaction summary couldn't fit, requiring 16K+ tokens
- Web search functionality is broken without Anthropic connectivity; potential solutions include SearXNG via MCP
📖 Read the full source: r/LocalLLaMA
👀 See Also

From 88 to 100 PSI: Claude Code for Front-End Optimisation
A developer used Claude Code to boost PageSpeed Insights from 88 to 100 on mobile. Key tactics: responsive images with srcset, IntersectionObserver, font preload removal. Claude worked as a debugging partner, not a one-prompt fix.

Structured AI Workflow with Phase-Based Commands to Reduce Rework
A developer shares a programmable workflow using specific commands like /pwf-brainstorm and /pwf-work-plan to address common AI coding issues: lost context, broken standards, and mixed planning/execution. The approach includes mandatory documentation updates and a multi-root project structure.

GitHub Repo Owners: Use Git's --author Flag to Block AI Bot Spam
Archestra fought AI comment/PR spam by exploiting GitHub's 'prior contributors' setting and Git's --author flag to whitelist real humans via a captcha-based onboarding flow.

Agent-Oriented API Design Patterns: Insights from Moltbook
Moltbook's API design supports proactive AI agent interactions by integrating direct instruction, state transitions, cognitive challenges, and educational rate-limiting.