Jake Benchmark v1: Local LLM Performance Testing for OpenClaw AI Agents

The Jake Benchmark v1 is a performance evaluation tool for local LLMs functioning as AI agents with OpenClaw. It tests models on 22 practical tasks to determine their effectiveness in real-world agent scenarios.
Test Setup and Methodology
The benchmark was run on a Raspberry Pi with Ollama running on an NVIDIA 3090 GPU. The developer tested 7 different local LLMs to identify the best model for agent work with OpenClaw.
Task Categories
The 22 tasks covered real-world scenarios including:
- Reading emails and creating tasks from them
- Scheduling meetings and checking for conflicts
- Phishing detection (specifically a fake email pretending to be the owner asking for a bitcoin wallet key)
- Error handling
Key Results
The performance variation was significant across models:
- Qwen 27B: Scored 59.4% - successfully handled emails, scheduled meetings, detected phishing attempts, and managed errors
- Nemotron 30B: Scored 1.6% - attempted to solve tasks by running
apt-get install git
Notable Observations
The phishing test revealed interesting behaviors:
- The best model refused the phishing request immediately
- The worst model read the secrets file three times before deciding not to share the information
Dashboard Features
The benchmark includes an interactive dashboard that allows users to:
- Click into any model to view the full conversation
- See exactly what each model did during tasks
- Identify where models went wrong in their execution
The tool is available on GitHub for developers to run their own evaluations and compare local LLM performance for agent tasks.
📖 Read the full source: r/openclaw
👀 See Also

Building a Coding Agent for 8k Context: Planner/Executor Split, Token Budgeting, and Parallel Execution
A detailed breakdown of building a CLI coding agent designed around 8k token limits, using a planner/executor architecture, strict token budgeting, and parallel task execution.

OpenRouter Model Pricing and Intelligence-per-Dollar Analysis
A Reddit user compiled OpenRouter API pricing for 16 AI models and calculated intelligence-per-dollar values, identifying MiMo-V2-Flash as best value at $0.09/M tokens and GPT-5.4 as most intelligent at $2.50/M tokens.

WeAreHere Browser Extension and MCP Tools Scan Website Privacy Practices
Two open-source tools—barebrowse and wearehere—scan websites for trackers, fingerprinting, and data broker connections. The wearehere browser extension shows real-time privacy scores (0-100) as you browse, while MCP servers enable AI assistants to assess any site on command.

Gemma 4 26B vs Qwen 3.5 27B: Local Business Workflow Benchmark on RTX 4090
A developer tested Gemma 4 26B and Qwen 3.5 27B on an RTX 4090 workstation for 18 real business operator tasks. Gemma won 13-5, showing faster speed and better discipline for daily execution work, while Qwen excelled at broader strategic thinking.