Running Google Gemma 4 26B-A4B Locally with LM Studio 0.4.0 Headless CLI

✍️ OpenClawRadar📅 Published: April 15, 2026🔗 Source
Running Google Gemma 4 26B-A4B Locally with LM Studio 0.4.0 Headless CLI
Ad

What LM Studio 0.4.0 Adds for Local AI

LM Studio 0.4.0 fundamentally changes the architecture by extracting the core inference engine into llmster, a standalone server. This enables running LM Studio entirely from the command line using the new lms CLI, eliminating the need for the GUI. The update makes it usable on headless servers, in CI/CD pipelines, SSH sessions, or for terminal-focused developers.

Key Features in 0.4.0

  • llmster daemon: A background service that manages model loading and inference without the desktop app
  • lms CLI: Full command-line interface for downloading, loading, chatting, and serving models
  • Parallel request processing: Continuous batching instead of sequential queuing, allowing multiple requests to the same model to run concurrently
  • Stateful REST API: A new /v1/chat endpoint that maintains conversation history across requests
  • MCP integration: Local Model Context Protocol support with permission-key gating
Ad

Why Gemma 4 26B-A4B for Local Use

Google's Gemma 4 26B-A4B uses a mixture-of-experts architecture with 128 experts plus 1 shared expert, but only activates 8 experts (3.8B parameters) per token. This means it runs well on hardware that couldn't handle a dense 26B model. On a 14" MacBook Pro M4 Pro with 48GB unified memory, it fits comfortably and generates at 51 tokens/second.

The model scores 82.6% on MMLU Pro and 88.3% on AIME 2026, close to the dense 31B variant (85.2% and 89.2%) while running dramatically faster. It achieves an Elo score of ~1441, competing with models like Qwen 3.5 397B-A17B (~1450 Elo) that require 100-600B total parameters.

Key capabilities include 256K max context, vision support for analyzing screenshots and diagrams, native function/tool calling, and reasoning with configurable thinking modes.

Practical Setup

The article walks through installing the lms CLI and setting up Gemma 4 26B-A4B for local inference that can be used with Claude Code. The author notes significant slowdowns when used within Claude Code from their experience.

📖 Read the full source: HN AI Agents

Ad

👀 See Also