TinyFish Web Agent Outperforms Competitors in Web Task Benchmarking

The TinyFish Web Agent has proven to be a leading tool in tackling complex web tasks, achieving an 81.9% success rate on hard tasks in the Online-Mind2Web benchmark, which consists of 300 tasks across 136 live websites. This figure starkly contrasts with major competitors, such as OpenAI Operator, which managed only a 43.2% success rate on similar tasks.
The Online-Mind2Web benchmark is a rigorous measure of a web agent's capabilities, testing them on tasks ranging from easy, like browsing credit card offers on Marriott, to complex challenges such as booking event tickets with dynamic pricing. Tasks involve multiple steps with live websites, including handling form validation and pop-ups, making it a realistic test compared to other less reliable benchmarks like WebVoyager.
TinyFish distinguishes itself by handling compounding errors effectively. It drops only 15.6 points from easy to hard tasks compared to massive drops shown by other systems, highlighting its robustness in real-world scenarios. Notably, it has published all 300 task runs, including their 40 failures, which offers transparency into its performance characteristics and failure cases, such as infrastructure-level anti-bot blocks encountered on sites like apartments.com.
Developers looking for a robust web automation tool would find TinyFish's open-source cookbook repository of interest, which provides insight into its architecture and execution methodology.
📖 Read the full source: HN AI Agents
👀 See Also

Privacy-First MCP Server Directory Launches with Documented Data Handling Policies
A new directory at toolora.dev/mcp-hub lists MCP servers with documented data handling policies, including local vs hosted classification, what data each tool transmits, and whether accounts are required. The creator also provides a browser test method to verify privacy claims.

Axe: A 12MB CLI for Single-Purpose LLM Agents
Axe is a lightweight Go binary that runs focused AI agents defined in TOML files. It treats agents like Unix programs, supporting stdin piping, sub-agent delegation, and multi-provider LLM integration.

Gemma4 26B-A4B Delivers Fast Local Performance with Web Search and Image Support
The gemma-4-26B-A4B model achieves approximately 145 tokens per second on an RTX 4090 and includes web search MCP and image support for chat applications. A blog post details setup and cross-platform usage on Mac and iPhone.

Agent MCP Studio: Build Multi-Agent MCP Systems Entirely in a Browser via WASM
Agent MCP Studio lets you design, orchestrate, and export MCP agent systems from a single static HTML file using WebAssembly – no backend, no Docker, no server.