Open-Source Benchmark Runner for Testing OpenClaw Agents on Real Workflows

✍️ OpenClawRadar📅 Published: May 14, 2026🔗 Source
Open-Source Benchmark Runner for Testing OpenClaw Agents on Real Workflows
Ad

A Reddit user has released an open-source tool called personal_agent_eval (repo: github.com/javiersgjavi/personal_agent_eval) for benchmarking OpenClaw agents on realistic, messy workflows — not public toy datasets.

Workflow

Define test cases as YAML files containing:

  • Input messages
  • Expected artifacts
  • Evaluation criteria
  • Deterministic checks
  • Run profiles and judge profiles

The runner executes cases against an actual OpenClaw instance, stores outputs, evaluates runs, and generates reports and charts.

Key Feature: Real Workspace Import

You can import your actual OpenClaw workspace — including memory, skills, files, prompts, and context — instead of a stripped-down imitation. The agent runs in a real OpenClaw instance, testing the exact agent you use daily.

Private Evaluation Sets

The author explicitly does not publish their private evaluation sets to avoid public benchmarks becoming stale. However, the repo includes example cases, configs, evaluation profiles, deterministic checks, and chart generation so you can build your own private suite.

Ad

SKILL.md for Agent Assistance

A SKILL.md file in the repo is designed to give an agent enough context to help you define new benchmark cases, run profiles, evaluation criteria, and deterministic checks — reducing manual editing.

Sample Results (Author’s Private Run)

The author shared a single-run comparison (metric unclear, likely weighted average 0-10):

Claude Opus 4.6 - 9.44
GLM 5.1 - 9.31
GPT-5.5 - 9.31
Claude Sonnet 4.6 - 9.25
DeepSeek V4 Flash - 8.61
Gemma 4 31B - 8.39
DeepSeek V4 Pro - 8.28
Kimi K2.6 - 7.97

More interesting than scores: failure modes. Some models reason well but are clumsy with tools; cheaper models degrade on long or stateful tasks; some failures are model behavior, others are OpenClaw/tooling edge cases exposed by the benchmark.

Who It’s For

OpenClaw users who run agents for real work and want to compare models on their own private tasks rather than arguing from vibes or generic leaderboards.

📖 Read the full source: r/openclaw

Ad

👀 See Also

AI-Setup CLI Tool Automatically Generates AI Configuration Files for Local LLM Stacks
Tools

AI-Setup CLI Tool Automatically Generates AI Configuration Files for Local LLM Stacks

AI-Setup is a CLI tool that scans codebases and automatically generates AI configuration files like .cursorrules and claude.md. It detects your stack to eliminate manual rule writing for each new project.

OpenClawRadar
Claude Code's Official Telegram Plugin: Setup Notes and Migration from OpenClaw
Tools

Claude Code's Official Telegram Plugin: Setup Notes and Migration from OpenClaw

A developer migrated from OpenClaw to Claude Code's official Telegram integration, documenting the setup process and creating an open-source migration skill. The integration connects via BotFather tokens and offers better token efficiency and cleaner communication.

OpenClawRadar
AutoAgents Rust Framework Adds Python Bindings for Prototyping
Tools

AutoAgents Rust Framework Adds Python Bindings for Prototyping

AutoAgents, a Rust-based multi-agent framework, now has Python bindings that allow developers to prototype in Python while maintaining the same Rust core runtime, provider interfaces, pipeline model, and agent semantics. The bindings enable experimentation with local AI models without external systems.

OpenClawRadar
x402 API Gateway for OpenClaw Bots: One Endpoint Replaces 18 API Keys
Tools

x402 API Gateway for OpenClaw Bots: One Endpoint Replaces 18 API Keys

An x402 API gateway eliminates the need for multiple API keys in OpenClaw bots by providing access to 18 services including smart LLM routing, web search, maps, travel, food, AI, and finance data through a single endpoint authenticated via USDC wallet credits.

OpenClawRadar