Qwen3.5:27B Beats Rivals in OpenClaw Agent Benchmark

Benchmark Setup and Results

A user tested 7 local models on 22 real agent tasks using OpenClaw on a Raspberry Pi 5 with an RTX 3090 running Ollama. The tasks included reading emails, scheduling meetings, creating tasks, detecting phishing, handling errors, and browser automation.

The winner by a massive margin was qwen3.5:27b-q4_K_M at 59.4%. The runner-up (qwen3.5:35b) scored only 23.2%. All other models scored below 5%.

Key Findings

The quantized 27B model beat the larger 35B version by 2.5x
A 30B model scored dead last at 1.6%
Medium thinking worked best - too much thinking actually hurt performance
Zero models could complete browser automation tasks
The main differentiator between winners and losers was whether the model could find and use command line tools
Most models couldn't even find basic tools like the email function

This benchmark provides concrete data on how different local LLMs perform as AI agents in practical scenarios. The significant performance gap between the top model and others suggests tool-finding capability is a critical bottleneck for local LLM agents.

📖 Read the full source: r/LocalLLaMA

OpenClaw Benchmark Shows Qwen3.5:27B Outperforms Other Local LLMs for Agent Tasks

Benchmark Setup and Results

Key Findings

👀 See Also

Brackish: Let Two Claude Code Instances Negotiate an API Contract via OpenAPI 3.1

OpenClaw skill reduces accessibility tree tokens from 600K to 1.3K for ad-heavy sites

WordPress.com MCP Integration Adds Write Capabilities for Claude

Open-source markdown vault gives Claude persistent memory across sessions