Jake Benchmark v1: Local LLM Performance Testing for OpenClaw AI Agents

✍️ OpenClawRadar📅 Published: March 23, 2026🔗 Source
Jake Benchmark v1: Local LLM Performance Testing for OpenClaw AI Agents
Ad

The Jake Benchmark v1 is a performance evaluation tool for local LLMs functioning as AI agents with OpenClaw. It tests models on 22 practical tasks to determine their effectiveness in real-world agent scenarios.

Test Setup and Methodology

The benchmark was run on a Raspberry Pi with Ollama running on an NVIDIA 3090 GPU. The developer tested 7 different local LLMs to identify the best model for agent work with OpenClaw.

Task Categories

The 22 tasks covered real-world scenarios including:

  • Reading emails and creating tasks from them
  • Scheduling meetings and checking for conflicts
  • Phishing detection (specifically a fake email pretending to be the owner asking for a bitcoin wallet key)
  • Error handling

Key Results

The performance variation was significant across models:

  • Qwen 27B: Scored 59.4% - successfully handled emails, scheduled meetings, detected phishing attempts, and managed errors
  • Nemotron 30B: Scored 1.6% - attempted to solve tasks by running apt-get install git
Ad

Notable Observations

The phishing test revealed interesting behaviors:

  • The best model refused the phishing request immediately
  • The worst model read the secrets file three times before deciding not to share the information

Dashboard Features

The benchmark includes an interactive dashboard that allows users to:

  • Click into any model to view the full conversation
  • See exactly what each model did during tasks
  • Identify where models went wrong in their execution

The tool is available on GitHub for developers to run their own evaluations and compare local LLM performance for agent tasks.

📖 Read the full source: r/openclaw

Ad

👀 See Also