Running Gemma 4 as a Local Autonomous Agent with Claude Code on 16GB VRAM

Local Agent Setup with Gemma 4 and Claude Code
A developer documented their process of replacing Anthropic's Claude API with a local 31-billion parameter Gemma 4 model to create an autonomous coding agent with full shell access via Claude Code CLI. The goal was to enable the local LLM to not just write code in chat but autonomously interact with the terminal, create folders, read structures, and act as a proactive development agent.
Hardware and Software Stack
- OS: Windows 11
- CPU & RAM: Intel Core Ultra 9 285K CPU with 64GB system RAM
- GPUs: NVIDIA RTX 4060 (8GB) + NVIDIA RTX 3050 (8GB) = 16GB total VRAM
- Core Model: google_gemma-4-31B-it (GGUF V3)
- Software Stack:
- llama.cpp (llama-server) - latest b8672 build
- Claude Code CLI - v2.1.92
- LiteLLM + custom Python gateway (agent_router.py) to bridge Anthropic streaming chunks to OpenAI APIs
Problem 1: Tool Call Parsing Failures
Initially, Gemma 4 refused to execute tools through the custom API routing, defaulting to apologies rather than action. When forced to output system tool calls natively, Claude Code CLI threw TypeScript errors: Cannot read properties of undefined (reading 'input_tokens').
The Fix: Gemma 4 uses an invisible <thought> reasoning block before finalizing output. The agent_router.py script was expecting traditional continuous text chunks, causing it to skip sending the mandatory initial message_start Anthropic event. The developer modified the Python interception loop to explicitly extract and combine reasoning_content with standard outputs, ensuring the stream always initialized with full usage metrics. Upgrading to llama.cpp build b8672 was mandatory for proper tokenizer functionality.
Problem 2: Context Window Limitations
Claude Code v2.1.92 operates with a massive system prompt that embeds the active folder tree and system instructions, dumping 7,182 tokens into the local server upon initialization. The initial n_ctx (context window) was capped at 4096 to save VRAM, causing immediate server crashes.
The Solution: The context window was doubled to 16,384 to accommodate the initial prompt and conversation history.
Problem 3: VRAM Allocation Challenges
With a 16K context window for a 31B model, VRAM allocation became problematic. A 16K context window using default settings requires approximately 6.4 GB of KV Cache alone. Windows WDDM overhead reserves roughly 20% of GPU memory for display/background buffers, leaving only ~12.8 GB accessible out of 16GB total VRAM before CUDA_out_of_memory errors.
The initial calculation showed: Model (13 GB) + KV Cache (6.4 GB) = 19.4 GB, exceeding available VRAM.
Final Configuration
The Math & Solution: The developer abandoned the Q3_K_M model (~13.7GB) and switched to the IQ3_XS format (~12.9GB). The optimized server startup command:
bat.\llm-server\llama-server.exe -m D:\gemma4\google_gemma-4-31B-it-IQ3_XS.gguf -c 16384 -ngl 38 -ctk q8_0 -ctv q8_0 --host 127.0.0.1 --port 8080
Key flags:
-ctk q8_0 -ctv q8_0: 8-bit KV Cache quantization that halved the KV Cache footprint from 6.4 GB-c 16384: 16K context window-ngl 38: Number of GPU layers
This configuration successfully runs Gemma 4 as a local autonomous agent on 16GB VRAM, though the source notes it works "almost" perfectly with some remaining challenges.
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenClaw Agent Architecture Patterns: Multi-Agent Delegation, 5-Layer Memory, and Watchdog Systems
A developer shares practical OpenClaw architecture patterns after 7 weeks of use, including multi-agent delegation with specialized models, a 5-layer memory system with decay, and a watchdog system with three monitoring layers.

RunLobster AI Agent Integrates Business Data for Operational Insights
A developer gave RunLobster root access to their business systems including Stripe, CRM, email, and call transcripts. The agent autonomously monitors operations, flags anomalies, and provides detailed briefings based on integrated data analysis.

Open-Source Claude Code Skill for Family Logistics Coordination
A developer built Parent Helper, a Claude Code skill that coordinates family schedules, meal planning, and grocery optimization using a single markdown file and MCP integrations. The tool projects $4.3K/year grocery savings by splitting lists across stores based on price.

OpenClaw setup for college baseball score updates with Telegram alerts
A developer built an OpenClaw flow that checks ASU and GT baseball games every ~8 minutes using ESPN's college baseball scoreboard API, sending Telegram alerts only when scores, innings, or final results change to avoid spam.