Jake Benchmark v1: Local LLM Performance Testing for OpenClaw AI Agents

✍️ OpenClawRadar📅 Published: March 23, 2026🔗 Source

The Jake Benchmark v1 is a performance evaluation tool for local LLMs functioning as AI agents with OpenClaw. It tests models on 22 practical tasks to determine their effectiveness in real-world agent scenarios.

Test Setup and Methodology

The benchmark was run on a Raspberry Pi with Ollama running on an NVIDIA 3090 GPU. The developer tested 7 different local LLMs to identify the best model for agent work with OpenClaw.

Task Categories

The 22 tasks covered real-world scenarios including:

Reading emails and creating tasks from them
Scheduling meetings and checking for conflicts
Phishing detection (specifically a fake email pretending to be the owner asking for a bitcoin wallet key)
Error handling

Key Results

The performance variation was significant across models:

Qwen 27B: Scored 59.4% - successfully handled emails, scheduled meetings, detected phishing attempts, and managed errors
Nemotron 30B: Scored 1.6% - attempted to solve tasks by running apt-get install git

Notable Observations

The phishing test revealed interesting behaviors:

The best model refused the phishing request immediately
The worst model read the secrets file three times before deciding not to share the information

Dashboard Features

The benchmark includes an interactive dashboard that allows users to:

Click into any model to view the full conversation
See exactly what each model did during tasks
Identify where models went wrong in their execution

The tool is available on GitHub for developers to run their own evaluations and compare local LLM performance for agent tasks.

📖 Read the full source: r/openclaw

👀 See Also

Tools

Memctl: Open Source MCP Server for Persistent Memory in AI Coding Agents

Memctl is an open source MCP server that provides AI coding agents with persistent memory across sessions, machines, and IDEs. Built primarily with Claude Code in two weeks, it stores project context and serves it back in subsequent sessions.

Mar 2, 2026, 01:45 AM UTC

OpenClawRadar

Tools

ClearSpec: A Spec Generator to Reduce Hallucination in Claude Code

ClearSpec is a tool that generates structured specifications from plain English descriptions, connecting to GitHub repos to reference real file paths and dependencies, then uses those specs as prompts for Claude Code to provide better context.

Apr 21, 2026, 06:25 PM UTC

OpenClawRadar

Tools

Ink: A Deployment Platform Where Claude AI Agents Are the Primary Users

Ink (ml.ink) is a deployment platform designed for AI agents like Claude, featuring one tool call deployment, auto-detection of frameworks, and integrated services including compute, databases, DNS, secrets, domains, metrics, and logs.

Mar 12, 2026, 11:45 PM UTC

OpenClawRadar

Tools

Reseed CLI: Extract Design Systems from Any Site for Claude Code and Cursor

Reseed is a CLI that extracts design tokens (colors, spacing, type scale, radii) from any website and generates a tailwind.config.ts, design-system.md, and reference HTML for Claude Code and Cursor to use.

May 7, 2026, 02:18 AM UTC

OpenClawRadar