Open-Source Benchmark Runner for Testing OpenClaw Agents on Real Workflows

A Reddit user has released an open-source tool called personal_agent_eval (repo: github.com/javiersgjavi/personal_agent_eval) for benchmarking OpenClaw agents on realistic, messy workflows — not public toy datasets.
Workflow
Define test cases as YAML files containing:
- Input messages
- Expected artifacts
- Evaluation criteria
- Deterministic checks
- Run profiles and judge profiles
The runner executes cases against an actual OpenClaw instance, stores outputs, evaluates runs, and generates reports and charts.
Key Feature: Real Workspace Import
You can import your actual OpenClaw workspace — including memory, skills, files, prompts, and context — instead of a stripped-down imitation. The agent runs in a real OpenClaw instance, testing the exact agent you use daily.
Private Evaluation Sets
The author explicitly does not publish their private evaluation sets to avoid public benchmarks becoming stale. However, the repo includes example cases, configs, evaluation profiles, deterministic checks, and chart generation so you can build your own private suite.
SKILL.md for Agent Assistance
A SKILL.md file in the repo is designed to give an agent enough context to help you define new benchmark cases, run profiles, evaluation criteria, and deterministic checks — reducing manual editing.
Sample Results (Author’s Private Run)
The author shared a single-run comparison (metric unclear, likely weighted average 0-10):
Claude Opus 4.6 - 9.44 GLM 5.1 - 9.31 GPT-5.5 - 9.31 Claude Sonnet 4.6 - 9.25 DeepSeek V4 Flash - 8.61 Gemma 4 31B - 8.39 DeepSeek V4 Pro - 8.28 Kimi K2.6 - 7.97
More interesting than scores: failure modes. Some models reason well but are clumsy with tools; cheaper models degrade on long or stateful tasks; some failures are model behavior, others are OpenClaw/tooling edge cases exposed by the benchmark.
Who It’s For
OpenClaw users who run agents for real work and want to compare models on their own private tasks rather than arguing from vibes or generic leaderboards.
📖 Read the full source: r/openclaw
👀 See Also

AI-Setup CLI Tool Automatically Generates AI Configuration Files for Local LLM Stacks
AI-Setup is a CLI tool that scans codebases and automatically generates AI configuration files like .cursorrules and claude.md. It detects your stack to eliminate manual rule writing for each new project.

Claude Code's Official Telegram Plugin: Setup Notes and Migration from OpenClaw
A developer migrated from OpenClaw to Claude Code's official Telegram integration, documenting the setup process and creating an open-source migration skill. The integration connects via BotFather tokens and offers better token efficiency and cleaner communication.

AutoAgents Rust Framework Adds Python Bindings for Prototyping
AutoAgents, a Rust-based multi-agent framework, now has Python bindings that allow developers to prototype in Python while maintaining the same Rust core runtime, provider interfaces, pipeline model, and agent semantics. The bindings enable experimentation with local AI models without external systems.

x402 API Gateway for OpenClaw Bots: One Endpoint Replaces 18 API Keys
An x402 API gateway eliminates the need for multiple API keys in OpenClaw bots by providing access to 18 services including smart LLM routing, web search, maps, travel, food, AI, and finance data through a single endpoint authenticated via USDC wallet credits.