Open-Source Benchmark Runner for Testing OpenClaw Agents on Real Workflows

✍️ OpenClawRadar📅 Published: May 14, 2026🔗 Source

A Reddit user has released an open-source tool called personal_agent_eval (repo: github.com/javiersgjavi/personal_agent_eval) for benchmarking OpenClaw agents on realistic, messy workflows — not public toy datasets.

Workflow

Define test cases as YAML files containing:

Input messages
Expected artifacts
Evaluation criteria
Deterministic checks
Run profiles and judge profiles

The runner executes cases against an actual OpenClaw instance, stores outputs, evaluates runs, and generates reports and charts.

Key Feature: Real Workspace Import

You can import your actual OpenClaw workspace — including memory, skills, files, prompts, and context — instead of a stripped-down imitation. The agent runs in a real OpenClaw instance, testing the exact agent you use daily.

Private Evaluation Sets

The author explicitly does not publish their private evaluation sets to avoid public benchmarks becoming stale. However, the repo includes example cases, configs, evaluation profiles, deterministic checks, and chart generation so you can build your own private suite.

SKILL.md for Agent Assistance

A SKILL.md file in the repo is designed to give an agent enough context to help you define new benchmark cases, run profiles, evaluation criteria, and deterministic checks — reducing manual editing.

Sample Results (Author’s Private Run)

The author shared a single-run comparison (metric unclear, likely weighted average 0-10):

Claude Opus 4.6 - 9.44
GLM 5.1 - 9.31
GPT-5.5 - 9.31
Claude Sonnet 4.6 - 9.25
DeepSeek V4 Flash - 8.61
Gemma 4 31B - 8.39
DeepSeek V4 Pro - 8.28
Kimi K2.6 - 7.97

More interesting than scores: failure modes. Some models reason well but are clumsy with tools; cheaper models degrade on long or stateful tasks; some failures are model behavior, others are OpenClaw/tooling edge cases exposed by the benchmark.

Who It’s For

OpenClaw users who run agents for real work and want to compare models on their own private tasks rather than arguing from vibes or generic leaderboards.

📖 Read the full source: r/openclaw

👀 See Also

Tools

Traversable Skill Graph for Persistent AI Agent Memory in Codebases

A developer built a three-layer skill graph system that lives inside a codebase, enabling AI coding assistants to maintain persistent memory across sessions. The system uses progressive disclosure with self-directing instructions instead of monolithic context files.

Mar 8, 2026, 12:45 AM UTC

OpenClawRadar

Tools

Flavian: A WordPress Development Framework with 24 Specialized Claude Code Agents

Flavian is an open-source WordPress development framework built around Claude Code, featuring 24 specialized agents for tasks like frontend development, security audits, and Figma-to-WordPress conversion. The creator found domain-specific agents significantly outperform general-purpose ones for WordPress development.

Mar 20, 2026, 10:45 PM UTC

OpenClawRadar

Tools

Offline Voice-to-Text Tool for macOS Using Local Whisper via MLX

A developer has open-sourced whisper-dictate, a macOS tool that provides fully offline voice-to-text transcription with real-time translation capabilities using Whisper running locally through MLX on Apple Silicon. Transcription takes about 500ms after speaking stops.

Mar 12, 2026, 07:45 PM UTC

OpenClawRadar

Tools

Claude Sessions: Lightweight Desktop App for Browsing Claude Code History

Claude Sessions is a new desktop application that lets developers browse their Claude Code session history locally. It reads from ~/.claude/projects, organizes sessions by project, handles large sessions up to 500k+ tokens without lag, and includes search functionality and keyboard navigation.

Apr 21, 2026, 10:24 AM UTC

OpenClawRadar