PhAIL Benchmark Tests VLA Models on Real Warehouse Robot Tasks

✍️ OpenClawRadar📅 Published: April 1, 2026🔗 Source
PhAIL Benchmark Tests VLA Models on Real Warehouse Robot Tasks
Ad

PhAIL is a physical AI benchmark that measures how well vision-language-action (VLA) models perform on commercial robotics tasks. The creator built it because they couldn't find honest performance numbers for these models in practical applications.

Benchmark Details

The benchmark tests four VLA models on bin-to-bin order picking, one of the most common warehouse operations:

  • OpenPI/pi0.5
  • GR00T
  • ACT
  • SmolVLA

All tests use the same equipment: a Franka FR3 robot with Robotiq 2F-85 gripper (DROID setup), with identical objects across hundreds of blind runs where the operator doesn't know which model is running.

Ad

Performance Results

The benchmark revealed significant performance gaps:

  • Best model performance: 64 units per hour (UPH)
  • Human teleoperating the same robot: 330 UPH
  • Human performing the task by hand: 1,300+ UPH

Open Data and Methodology

Everything from the benchmark is publicly available:

  • Every run with synced video and telemetry data
  • The fine-tuning dataset used for training
  • Training scripts
  • An open leaderboard accepting new submissions

The creator is available to answer questions about methodology, the specific models tested, or observations from the benchmark runs.

📖 Read the full source: HN AI Agents

Ad

👀 See Also