Gemma4 26B-4B: 145 Tokens/s on RTX 4090 with Web Search

Gemma4 26B-A4B Performance and Features

The gemma-4-26B-A4B model demonstrates strong performance for local use, with the source reporting speeds of approximately 145 tokens per second when running on an RTX 4090 GPU. This combination of capability and speed makes it suitable for responsive local applications.

Key Features from Source

Model: gemma-4-26B-A4B
Performance: ~145 t/s (tokens per second) on RTX 4090
Integration: Web search MCP (Model Context Protocol) support
Multimodal: Image support included
Platforms: Setup documented for Mac and iPhone usage

The source mentions that the experience can be improved with simple tricks and a short system prompt, though specific details about these optimizations are not provided in the excerpt. The author has documented their complete setup process in a blog post that covers configuration and usage across multiple devices.

For developers interested in implementing this setup, the full configuration details, system prompts, and optimization techniques are available in the referenced blog post at the provided URL.

📖 Read the full source: r/LocalLLaMA

Gemma4 26B-A4B Delivers Fast Local Performance with Web Search and Image Support

Gemma4 26B-A4B Performance and Features

Key Features from Source

👀 See Also

Brain-MCP Developer Documents Tools for Claude AI Instead of Humans

llm-idle-timeout Fires at 2 Minutes on N100/WSL2 Despite timeoutSeconds Setting

boxBot: An Open-Source Smart Speaker Powered by Claude and Hailo AI

Jan-Code-4B: A Lightweight Code-Tuned Model for Local Development