PhAIL Benchmark Tests VLA Models on Real Warehouse Robot Tasks

PhAIL is a physical AI benchmark that measures how well vision-language-action (VLA) models perform on commercial robotics tasks. The creator built it because they couldn't find honest performance numbers for these models in practical applications.
Benchmark Details
The benchmark tests four VLA models on bin-to-bin order picking, one of the most common warehouse operations:
- OpenPI/pi0.5
- GR00T
- ACT
- SmolVLA
All tests use the same equipment: a Franka FR3 robot with Robotiq 2F-85 gripper (DROID setup), with identical objects across hundreds of blind runs where the operator doesn't know which model is running.
Performance Results
The benchmark revealed significant performance gaps:
- Best model performance: 64 units per hour (UPH)
- Human teleoperating the same robot: 330 UPH
- Human performing the task by hand: 1,300+ UPH
Open Data and Methodology
Everything from the benchmark is publicly available:
- Every run with synced video and telemetry data
- The fine-tuning dataset used for training
- Training scripts
- An open leaderboard accepting new submissions
The creator is available to answer questions about methodology, the specific models tested, or observations from the benchmark runs.
📖 Read the full source: HN AI Agents
👀 See Also

Developer builds Rust compression library with Claude Opus 4.6, questions utility
A developer used Claude Opus 4.6 for two weeks to create a 15,800-line Rust compression library with 449 passing tests, Python bindings, and C FFI layer, but questions whether another compression library was needed.

Multi-Agent Haiku System Matches Claude Opus on Complex Number Theory Problem at 15x Lower Cost
A Reddit experiment shows a two-Haiku agent system (generator + auditor) achieving identical 4/4 scores to Claude Opus 4.5 on a difficult Fermat's Little Theorem proof, while costing approximately $0.004 per query versus $0.06 for Opus.

2-Prompt System to Carry Context Between Claude Chats Without Token Waste
A developer shares two prompts for compressing an entire Claude conversation into a structured context block and loading it into a fresh chat, preserving decisions, work, and next steps.

LetMeWatch: Python Plugin Adds Video Analysis to Claude via FFmpeg Scene Detection
A developer built a ~200-line Python plugin called LetMeWatch that enables Claude to analyze videos by using FFmpeg for scene detection, extracting only frames where visuals change, timestamping them, and feeding batches to Claude's multimodal vision.