Run Gemma 4 31B Locally as Autonomous Agent with Claude Code

Local Agent Setup with Gemma 4 and Claude Code

A developer documented their process of replacing Anthropic's Claude API with a local 31-billion parameter Gemma 4 model to create an autonomous coding agent with full shell access via Claude Code CLI. The goal was to enable the local LLM to not just write code in chat but autonomously interact with the terminal, create folders, read structures, and act as a proactive development agent.

Hardware and Software Stack

OS: Windows 11
CPU & RAM: Intel Core Ultra 9 285K CPU with 64GB system RAM
GPUs: NVIDIA RTX 4060 (8GB) + NVIDIA RTX 3050 (8GB) = 16GB total VRAM
Core Model: google_gemma-4-31B-it (GGUF V3)
Software Stack:
- llama.cpp (llama-server) - latest b8672 build
- Claude Code CLI - v2.1.92
- LiteLLM + custom Python gateway (agent_router.py) to bridge Anthropic streaming chunks to OpenAI APIs

Problem 1: Tool Call Parsing Failures

Initially, Gemma 4 refused to execute tools through the custom API routing, defaulting to apologies rather than action. When forced to output system tool calls natively, Claude Code CLI threw TypeScript errors: Cannot read properties of undefined (reading 'input_tokens').

The Fix: Gemma 4 uses an invisible <thought> reasoning block before finalizing output. The agent_router.py script was expecting traditional continuous text chunks, causing it to skip sending the mandatory initial message_start Anthropic event. The developer modified the Python interception loop to explicitly extract and combine reasoning_content with standard outputs, ensuring the stream always initialized with full usage metrics. Upgrading to llama.cpp build b8672 was mandatory for proper tokenizer functionality.

Problem 2: Context Window Limitations

Claude Code v2.1.92 operates with a massive system prompt that embeds the active folder tree and system instructions, dumping 7,182 tokens into the local server upon initialization. The initial n_ctx (context window) was capped at 4096 to save VRAM, causing immediate server crashes.

The Solution: The context window was doubled to 16,384 to accommodate the initial prompt and conversation history.

Problem 3: VRAM Allocation Challenges

With a 16K context window for a 31B model, VRAM allocation became problematic. A 16K context window using default settings requires approximately 6.4 GB of KV Cache alone. Windows WDDM overhead reserves roughly 20% of GPU memory for display/background buffers, leaving only ~12.8 GB accessible out of 16GB total VRAM before CUDA_out_of_memory errors.

The initial calculation showed: Model (13 GB) + KV Cache (6.4 GB) = 19.4 GB, exceeding available VRAM.

Final Configuration

The Math & Solution: The developer abandoned the Q3_K_M model (~13.7GB) and switched to the IQ3_XS format (~12.9GB). The optimized server startup command:

bat.\llm-server\llama-server.exe -m D:\gemma4\google_gemma-4-31B-it-IQ3_XS.gguf -c 16384 -ngl 38 -ctk q8_0 -ctv q8_0 --host 127.0.0.1 --port 8080

Key flags:

-ctk q8_0 -ctv q8_0: 8-bit KV Cache quantization that halved the KV Cache footprint from 6.4 GB
-c 16384: 16K context window
-ngl 38: Number of GPU layers

This configuration successfully runs Gemma 4 as a local autonomous agent on 16GB VRAM, though the source notes it works "almost" perfectly with some remaining challenges.

📖 Read the full source: r/LocalLLaMA