OpenClaw Benchmark Shows Qwen3.5:27B Outperforms Other Local LLMs for Agent Tasks

✍️ OpenClawRadar📅 Published: March 28, 2026🔗 Source
OpenClaw Benchmark Shows Qwen3.5:27B Outperforms Other Local LLMs for Agent Tasks
Ad

Benchmark Setup and Results

A user tested 7 local models on 22 real agent tasks using OpenClaw on a Raspberry Pi 5 with an RTX 3090 running Ollama. The tasks included reading emails, scheduling meetings, creating tasks, detecting phishing, handling errors, and browser automation.

The winner by a massive margin was qwen3.5:27b-q4_K_M at 59.4%. The runner-up (qwen3.5:35b) scored only 23.2%. All other models scored below 5%.

Key Findings

  • The quantized 27B model beat the larger 35B version by 2.5x
  • A 30B model scored dead last at 1.6%
  • Medium thinking worked best - too much thinking actually hurt performance
  • Zero models could complete browser automation tasks
  • The main differentiator between winners and losers was whether the model could find and use command line tools
  • Most models couldn't even find basic tools like the email function

This benchmark provides concrete data on how different local LLMs perform as AI agents in practical scenarios. The significant performance gap between the top model and others suggests tool-finding capability is a critical bottleneck for local LLM agents.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also