Running Google Gemma 4 26B-A4B Locally with LM Studio 0.4.0 Headless CLI

What LM Studio 0.4.0 Adds for Local AI
LM Studio 0.4.0 fundamentally changes the architecture by extracting the core inference engine into llmster, a standalone server. This enables running LM Studio entirely from the command line using the new lms CLI, eliminating the need for the GUI. The update makes it usable on headless servers, in CI/CD pipelines, SSH sessions, or for terminal-focused developers.
Key Features in 0.4.0
- llmster daemon: A background service that manages model loading and inference without the desktop app
- lms CLI: Full command-line interface for downloading, loading, chatting, and serving models
- Parallel request processing: Continuous batching instead of sequential queuing, allowing multiple requests to the same model to run concurrently
- Stateful REST API: A new /v1/chat endpoint that maintains conversation history across requests
- MCP integration: Local Model Context Protocol support with permission-key gating
Why Gemma 4 26B-A4B for Local Use
Google's Gemma 4 26B-A4B uses a mixture-of-experts architecture with 128 experts plus 1 shared expert, but only activates 8 experts (3.8B parameters) per token. This means it runs well on hardware that couldn't handle a dense 26B model. On a 14" MacBook Pro M4 Pro with 48GB unified memory, it fits comfortably and generates at 51 tokens/second.
The model scores 82.6% on MMLU Pro and 88.3% on AIME 2026, close to the dense 31B variant (85.2% and 89.2%) while running dramatically faster. It achieves an Elo score of ~1441, competing with models like Qwen 3.5 397B-A17B (~1450 Elo) that require 100-600B total parameters.
Key capabilities include 256K max context, vision support for analyzing screenshots and diagrams, native function/tool calling, and reasoning with configurable thinking modes.
Practical Setup
The article walks through installing the lms CLI and setting up Gemma 4 26B-A4B for local inference that can be used with Claude Code. The author notes significant slowdowns when used within Claude Code from their experience.
📖 Read the full source: HN AI Agents
👀 See Also

Running OpenClaw and Codex CLI Natively on Android via AnyClaw APK
A developer has packaged OpenClaw and Codex CLI into an Android APK called AnyClaw, enabling the gateway and Control UI to run locally on ARM64 Android 7.0+ devices without root. The project required building dependencies from source and patching multiple components to handle Android-specific constraints.

Heartbeat-gateway: Event-driven replacement for cron polling in OpenClaw
Heartbeat-gateway is an open-source Python tool that replaces cron-based polling with webhook-driven events for OpenClaw, reducing API costs from ~$86/month to ~$4.50/month and improving latency from up to 30 minutes to under 2 seconds.

Orkestra: Cost-Aware LLM Routing Layer for OpenClaw Reduces API Costs by 60-80%
Orkestra is a modular routing layer that sits in front of LLM calls in OpenClaw, using semantic classification to route prompts to budget, balanced, or premium model tiers. The approach reduced API costs by 60-80% without prompt rewriting or complex rules.

Benchmarking 88 Small GGUF Models on a 16GB Mac Mini M4
An automated pipeline tested 88 GGUF models on a Mac Mini M4 with 16GB RAM, identifying 9 as unusable and 4 LFM2-8B-A1B MoE models on the Pareto frontier for speed and quality.