Running Gemma 4 as a Local Autonomous Agent with Claude Code on 16GB VRAM

✍️ OpenClawRadar📅 Published: April 16, 2026🔗 Source
Running Gemma 4 as a Local Autonomous Agent with Claude Code on 16GB VRAM
Ad

Local Agent Setup with Gemma 4 and Claude Code

A developer documented their process of replacing Anthropic's Claude API with a local 31-billion parameter Gemma 4 model to create an autonomous coding agent with full shell access via Claude Code CLI. The goal was to enable the local LLM to not just write code in chat but autonomously interact with the terminal, create folders, read structures, and act as a proactive development agent.

Hardware and Software Stack

  • OS: Windows 11
  • CPU & RAM: Intel Core Ultra 9 285K CPU with 64GB system RAM
  • GPUs: NVIDIA RTX 4060 (8GB) + NVIDIA RTX 3050 (8GB) = 16GB total VRAM
  • Core Model: google_gemma-4-31B-it (GGUF V3)
  • Software Stack:
    • llama.cpp (llama-server) - latest b8672 build
    • Claude Code CLI - v2.1.92
    • LiteLLM + custom Python gateway (agent_router.py) to bridge Anthropic streaming chunks to OpenAI APIs

Problem 1: Tool Call Parsing Failures

Initially, Gemma 4 refused to execute tools through the custom API routing, defaulting to apologies rather than action. When forced to output system tool calls natively, Claude Code CLI threw TypeScript errors: Cannot read properties of undefined (reading 'input_tokens').

The Fix: Gemma 4 uses an invisible <thought> reasoning block before finalizing output. The agent_router.py script was expecting traditional continuous text chunks, causing it to skip sending the mandatory initial message_start Anthropic event. The developer modified the Python interception loop to explicitly extract and combine reasoning_content with standard outputs, ensuring the stream always initialized with full usage metrics. Upgrading to llama.cpp build b8672 was mandatory for proper tokenizer functionality.

Ad

Problem 2: Context Window Limitations

Claude Code v2.1.92 operates with a massive system prompt that embeds the active folder tree and system instructions, dumping 7,182 tokens into the local server upon initialization. The initial n_ctx (context window) was capped at 4096 to save VRAM, causing immediate server crashes.

The Solution: The context window was doubled to 16,384 to accommodate the initial prompt and conversation history.

Problem 3: VRAM Allocation Challenges

With a 16K context window for a 31B model, VRAM allocation became problematic. A 16K context window using default settings requires approximately 6.4 GB of KV Cache alone. Windows WDDM overhead reserves roughly 20% of GPU memory for display/background buffers, leaving only ~12.8 GB accessible out of 16GB total VRAM before CUDA_out_of_memory errors.

The initial calculation showed: Model (13 GB) + KV Cache (6.4 GB) = 19.4 GB, exceeding available VRAM.

Final Configuration

The Math & Solution: The developer abandoned the Q3_K_M model (~13.7GB) and switched to the IQ3_XS format (~12.9GB). The optimized server startup command:

bat.\llm-server\llama-server.exe -m D:\gemma4\google_gemma-4-31B-it-IQ3_XS.gguf -c 16384 -ngl 38 -ctk q8_0 -ctv q8_0 --host 127.0.0.1 --port 8080

Key flags:

  • -ctk q8_0 -ctv q8_0: 8-bit KV Cache quantization that halved the KV Cache footprint from 6.4 GB
  • -c 16384: 16K context window
  • -ngl 38: Number of GPU layers

This configuration successfully runs Gemma 4 as a local autonomous agent on 16GB VRAM, though the source notes it works "almost" perfectly with some remaining challenges.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also

Claude Code Agents Orchestrator Pipeline: Work Queues, Agent Spawning, Verification Gates
Use Cases

Claude Code Agents Orchestrator Pipeline: Work Queues, Agent Spawning, Verification Gates

A Reddit post from r/clawdbot details how Claude Code agents operate an AI-run store, handling design, marketing, QA, and ops 30 times daily. It links to Episode 9 of a blog series that explains the orchestrator pipeline in production, including issues not shown in demos.

OpenClawRadar
SDR Uses AI-Generated Video Follow-Ups to Re-engage Cold D2C Prospects
Use Cases

SDR Uses AI-Generated Video Follow-Ups to Re-engage Cold D2C Prospects

An SDR at a SaaS company targeting D2C brands reports success using AI-generated video follow-ups instead of text emails. The workflow involves writing a prompt in Claude, generating a video with Magic Hour, and optionally polishing the voiceover with ElevenLabs.

OpenClawRadar
Case Study: Using Multiple AI Agents to Build a Production C++ Library
Use Cases

Case Study: Using Multiple AI Agents to Build a Production C++ Library

A developer documented a multi-month process using four AI agents (Claude, ChatGPT, Gemini, Grok) with distinct roles to build FAT-P, a header-only C++20 library with 107 headers and zero external dependencies. The system included cross-review, governance documents written by AI, and a demerit tracker to encode failure modes.

OpenClawRadar
Using Opus 4.6 and GPT 5.4 to peer-review a memory stack design for OpenClaw
Use Cases

Using Opus 4.6 and GPT 5.4 to peer-review a memory stack design for OpenClaw

A developer used Claude Opus 4.6 to design a three-layer memory stack for OpenClaw, then had GPT 5.4 peer-review the design. The stack includes Lossless Claw for message preservation, SQLite hybrid search for keyword matching, and Mem0 Cloud for cross-session persistence.

OpenClawRadar