PhAIL Benchmark Tests VLA Models on Real Warehouse Robot Tasks

✍️ OpenClawRadar📅 Published: April 1, 2026🔗 Source

PhAIL is a physical AI benchmark that measures how well vision-language-action (VLA) models perform on commercial robotics tasks. The creator built it because they couldn't find honest performance numbers for these models in practical applications.

Benchmark Details

The benchmark tests four VLA models on bin-to-bin order picking, one of the most common warehouse operations:

OpenPI/pi0.5
GR00T
ACT
SmolVLA

All tests use the same equipment: a Franka FR3 robot with Robotiq 2F-85 gripper (DROID setup), with identical objects across hundreds of blind runs where the operator doesn't know which model is running.

Performance Results

The benchmark revealed significant performance gaps:

Best model performance: 64 units per hour (UPH)
Human teleoperating the same robot: 330 UPH
Human performing the task by hand: 1,300+ UPH

Open Data and Methodology

Everything from the benchmark is publicly available:

Every run with synced video and telemetry data
The fine-tuning dataset used for training
Training scripts
An open leaderboard accepting new submissions

The creator is available to answer questions about methodology, the specific models tested, or observations from the benchmark runs.

📖 Read the full source: HN AI Agents

👀 See Also

Tools

Godogen: Claude Code Skills for Complete Godot Game Generation

Godogen is an open-source pipeline that uses Claude Code skills to generate complete, playable Godot 4 projects from text prompts. It handles architecture design, 2D/3D asset generation, GDScript writing, and visual QA testing, addressing specific engineering bottlenecks like GDScript training data scarcity and build-time vs runtime state issues.

Mar 16, 2026, 09:45 PM UTC

OpenClawRadar

🦀

Tools

Terry Tao Ports 24 Java Applets to JavaScript with LLM Agent — Finds Bugs in Original Code

Terence Tao used an LLM coding agent to port his 1999 Java applets to JavaScript in hours. The agent found two bugs in the original code and introduced only one minor drag-handling issue.

Jul 12, 2026, 12:15 PM UTC

OpenClawRadar

Tools

Open-Sourced the-vibe-stack: Markdown Rules to Maintain Claude Code Consistency

A developer has open-sourced 'the-vibe-stack' — a set of Markdown rules designed to keep Claude Code on track during long sessions by enforcing a rigid schema. The approach aims to reduce logic drift and token waste while ensuring predictable output.

Apr 16, 2026, 09:45 PM UTC

OpenClawRadar

Tools

MAGELLAN: A 15-Agent Autonomous Scientific Discovery System Built on Claude Code

MAGELLAN is a 15-agent autonomous scientific discovery system built entirely on Claude Code. It uses Opus for deep reasoning and Sonnet for structured tasks, generating cross-disciplinary hypotheses without human direction, with 260 hypotheses proposed and 60% killed by adversarial validation in 19 sessions.

Mar 30, 2026, 04:45 AM UTC

OpenClawRadar