Gemma Gem: On-Device AI Agent for Browser Automation via WebGPU

Gemma Gem is a Chrome extension that loads Google's Gemma 4 model (2B or 4B) through WebGPU in an offscreen document, giving it tools to interact with webpages directly in the browser without external API calls.

Key Details

The extension provides several tools that run in different contexts:

read_page_content: Read text/HTML of the page or a CSS selector (Content script)
take_screenshot: Capture visible page as PNG (Service worker)
click_element: Click an element by CSS selector (Content script)
type_text: Type into an input by CSS selector (Content script)
scroll_page: Scroll up/down by pixel amount (Content script)
run_javascript: Execute JS in the page context with full DOM access (Service worker)

The architecture uses three main components:

Offscreen document: Hosts the model via @huggingface/transformers + WebGPU, runs the agent loop
Service worker: Routes messages between content scripts and offscreen document, handles take_screenshot and run_javascript
Content script: Injects gem icon + shadow DOM chat overlay, executes DOM tools

Setup and Usage

Requirements:

Chrome with WebGPU support
~500MB disk for E2B model, ~1.5GB for E4B (cached after first run)

Setup commands:

pnpm install
pnpm build

Load the extension in chrome://extensions (developer mode) from .output/chrome-mv3-dev/.

Usage:

Navigate to any page
Click the gem icon (bottom-right corner) to open the chat
Wait for model to load (progress shown on icon + chat)
Ask questions about the page or request actions

Settings and Configuration

Available settings via gear icon in chat header:

Model: Switch between Gemma 4 E2B (~500MB) and E4B (~1.5GB) - selection persists across sessions
Thinking: Toggle native Gemma 4 thinking
Max iterations: Cap on tool call loops per request
Clear context: Reset conversation history for the current page
Disable on this site: Disable the extension per-hostname (persisted)

Development and Debugging

Tech stack:

WXT — Chrome extension framework (Vite-based)
@huggingface/transformers — Browser ML inference
marked — Markdown rendering in chat
Gemma 4 E2B / E4B (onnx-community/gemma-4-E2B-it-ONNX, onnx-community/gemma-4-E4B-it-ONNX) — q4f16 quantization, 128K context

Build commands:

pnpm build        # Development build (with logging, source maps)
pnpm build:prod   # Production build (logging silenced, minified)

Debugging locations:

Service worker logs: chrome://extensions → Gemma Gem → "Inspect views: service worker"
Offscreen document logs: chrome://extensions → Gemma Gem → "Inspect views: offscreen.html"
Content script logs: Open DevTools on any page → Console
All extension pages: chrome://inspect#other lists all inspectable extension contexts

The offscreen document logs show model loading, prompt construction, token counts, raw model output, and tool execution.

Technical Notes

The agent/ directory has zero dependencies and defines interfaces (ModelBackend, ToolExecutor) that can be extracted as a standalone library. The extension includes a thinking mode that shows chain-of-thought reasoning as it works.

According to the source, the agent works for simple page questions and running JavaScript, but multi-step tool chains are unreliable and it sometimes ignores its tools entirely.

📖 Read the full source: HN AI Agents