Qwen 3.5 35B Running on 8GB VRAM with llama.cpp Configuration

Local Qwen 3.5 35B Setup on Limited VRAM
A developer on r/LocalLLaMA detailed their configuration for running the Qwen 3.5 35B model locally on hardware with 8GB of VRAM. They moved from using Antigravity (with a Google AI Pro plan) to local LLMs after hitting limits with the cloud service.
Hardware and Model Specifications
The setup uses a Lenovo Legion laptop with an i9-14900HX CPU (with E-cores disabled in BIOS, 32GB DDR5 RAM) and an RTX 4060m GPU with 8GB VRAM. The specific model is Qwen 3.5 35B A3B Heretic Opus (Q4_K_M GGUF).
Performance and llama.cpp Configuration
The developer reports getting approximately 700 tokens per second for prompt processing and 42 tokens per second for token generation with this setup. They provided their llama.cpp command-line arguments after testing:
-ngl 99 ^ --n-cpu-moe 40 ^ -c 192000 ^ -t 12 ^ -tb 16 ^ -b 4096 ^ --ubatch-size 2048 ^ --flash-attn on ^ --cache-type-k q8_0 ^ --cache-type-v q8_0 ^ --mlock
Workflow Integration
For their agentic workflow, they found Cline in VSCode to be the closest alternative to Antigravity. They use kat-coder-pro for Plan mode and qwen3.5 for Act mode within this setup. The developer is seeking feedback on whether this local configuration is better than sticking with Google Gemini 3 Flash in Antigravity, noting they prioritize smooth workflow over privacy concerns.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Detrix MCP Server Adds Runtime Debugging to AI Coding Agents
Detrix is a free, open-source MCP server that enables MCP-compatible agents to observe live variables in running code without restarts or code changes. It supports Python, Go, and Rust applications running locally or in Docker.

llm-idle-timeout Fires at 2 Minutes on N100/WSL2 Despite timeoutSeconds Setting
A user reports that the idle watchdog in OpenClaw fires after 2 minutes on N100/WSL2 hardware, ignoring the timeoutSeconds=300 setting, due to slow gateway startup (45+ seconds) and no configurable noOutputTimeoutMs.

Context Mode: An MCP Server That Compresses Tool Outputs for Claude Code
Context Mode is an MCP server that sits between Claude Code and tool outputs, processing them in sandboxes and returning only summaries. It reduces 315 KB of MCP output to 5.4 KB, extending session time before slowdown from ~30 minutes to ~3 hours.

Claude Octopus v8.48: Multi-AI Orchestration Plugin for Development Workflows
Claude Octopus v8.48 is an open-source plugin that orchestrates Claude, Codex, and Gemini AI models in parallel with distinct roles across development phases. It includes a 75% consensus gate between phases, fresh context windows for complex tasks, and specific commands like /octo:embrace for full lifecycle development.