Qwen 3.5 35B Running on 8GB VRAM with llama.cpp Configuration

✍️ OpenClawRadar📅 Published: March 27, 2026🔗 Source
Qwen 3.5 35B Running on 8GB VRAM with llama.cpp Configuration
Ad

Local Qwen 3.5 35B Setup on Limited VRAM

A developer on r/LocalLLaMA detailed their configuration for running the Qwen 3.5 35B model locally on hardware with 8GB of VRAM. They moved from using Antigravity (with a Google AI Pro plan) to local LLMs after hitting limits with the cloud service.

Hardware and Model Specifications

The setup uses a Lenovo Legion laptop with an i9-14900HX CPU (with E-cores disabled in BIOS, 32GB DDR5 RAM) and an RTX 4060m GPU with 8GB VRAM. The specific model is Qwen 3.5 35B A3B Heretic Opus (Q4_K_M GGUF).

Performance and llama.cpp Configuration

The developer reports getting approximately 700 tokens per second for prompt processing and 42 tokens per second for token generation with this setup. They provided their llama.cpp command-line arguments after testing:

-ngl 99 ^
--n-cpu-moe 40 ^
-c 192000 ^
-t 12 ^
-tb 16 ^
-b 4096 ^
--ubatch-size 2048 ^
--flash-attn on ^
--cache-type-k q8_0 ^
--cache-type-v q8_0 ^
--mlock
Ad

Workflow Integration

For their agentic workflow, they found Cline in VSCode to be the closest alternative to Antigravity. They use kat-coder-pro for Plan mode and qwen3.5 for Act mode within this setup. The developer is seeking feedback on whether this local configuration is better than sticking with Google Gemini 3 Flash in Antigravity, noting they prioritize smooth workflow over privacy concerns.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also