Iván Hernández Dalas: PhAIL ranks top robotics foundation models on real hardware

phail leaderboard screengrab.

Positronic Robotics evaluated four VLA models on bin-to-bin order picking. | Credit: Positronic Robotics

Positronic Robotics, which said it helps developers make robots work with artificial intelligence, has launched its “Physical AI Leaderboard,” or PhAIL. It is an ongoing, benchmark evaluating robotics foundation models on commercial tasks.

Founded in September 2025, Positronic said it has developed an open-source infrastructure to standardize and scale physical AI by bridging the gap between research foundation models and real-world robotic production. The Springfield, Mo.-based company‘s system uses a unified Python toolkit for the entire robotics lifecycle and the PhAIL benchmark.

PhAIL evaluates models on physical robotic setups performing commercially relevant operations. Positronic Robotics has started with bin-to-bin order picking — one of the most common tasks in logistics and industrial automation. In this task, items are transferred one at a time from an inbound container to an outbound container.

The current evaluation rig uses a Franka Research 3 robotic arm paired with a Robotiq 2F-85 gripper in DROID-style configuration, a widely used and reproducible research platform.

PhAIL measures throughput and reliability

Physical AI has advanced rapidly in recent years, with foundation models capable of handling increasingly diverse manipulation tasks. But most benchmarks still rely on simulation or controlled laboratory conditions, and many public evaluations emphasize curated demonstration videos rather than sustained operation. For industrial deployment, two variables dominate: throughput and reliability.

PhAIL measures both directly. Each run is executed on real hardware, not in simulation. Model checkpoints are selected randomly and evaluated in blinded conditions. Every run is logged and published with synchronized video, robot telemetry, station metadata, and scoring artifacts.

From these runs, PhAIL computes units per hour (UPH), and mean time between failures or assists (MTBF/A) – the same metrics an operations manager would use to evaluate a deployment, rather than an academic “success rate.” The protocol is fully documented in the PhAIL white paper.

The Physical AI Leaderboard itself is hardware-agnostic. Positronic Robotics said it plans to add robotic embodiments in Q2 2026 to reflect the diversity of real-world deployments. Bin-to-bin picking is only the starting point, it said. The benchmark’s goal is to measure how well AI models perform on repetitive, economically important operations that occur thousands of times per day in real facilities.

“We all dream about a robot that folds our laundry – but that’s a task that happens once a day. In factories and logistics, the same operation runs hundreds of times per shift, and most of those still aren’t solved,” said Sergey Arkhangelskiy, founder of Positronic Robotics. “Physical AI needs to prove itself there first, and PhAIL is how we measure whether it can.”

Positronic Robotics evaluates models

In the inaugural evaluations, four models were fine-tuned and tested: OpenPI 0.5 from Physical Intelligence, GR00T from NVIDIA, SmolVLA from HuggingFace/LeRobot, and ACT from LeRobot – as well as teleoperated and human baselines. The results show a measurable gap between current foundation models and human-level performance in both throughput and reliability on commercial picking tasks.

Positronic Robotics described it as calibration — a transparent baseline that allows progress to be measured consistently over time. As new models are released, they can be evaluated under the same protocol, creating a continuous, comparable record of performance, it said.

The company asserted that PhAIL targets three structural issues in the physical AI ecosystem:

  • Lack of objective measurement of commercial readiness. Most public metrics do not reflect factory-floor constraints.
  • Unclear return-on-investment (ROI) signals for operators. 
Success rates do not translate directly into deployment decisions.
  • A broken feedback loop for model builders.
Without standardized, auditable benchmarks, it is difficult to iterate toward real-world reliability.

By publishing synchronized video, logs, firmware versions, hardware configuration, and scoring artifacts for every run, PhAIL emphasizes auditability and reproducibility, said Positronic Robotics.

It launched PhAIL as a governed consortium rather than as a proprietary product. Nebius, which provides an AI cloud foundation for the robotics lifecycle, has joined as a founding consortium partner. Toloka participates as a data partner supporting evaluation processes. Positronic Robotics noted that the benchmark is intended as a shared industry yardstick, not as a competitive marketing vehicle.

“Scaling physical AI requires a clear, shared standard for production readiness,” said Evan Helda, head of physical AI at Nebius. “With no established blueprint for deploying these systems at scale, the PhAIL Leaderboard delivers an important benchmark grounded in real-world performance data—bringing greater transparency to what’s ready for deployment.”

“Nebius is committed to accelerating physical AI development across the ecosystem,” he added. “Through our participation in the PhAIL consortium, we’re proud to help advance the next phase of commercial robotics alongside industry partners.”

The PhAIL dataset and fine-tuning scripts are publicly available. Model builders can fine-tune their systems and submit checkpoints for evaluation. Hardware vendors can validate model performance across embodiments. Operators can review published artifacts directly.


Catch the latest in physical AI at the Robotics Summit & Expo

Registration is now open for the Robotics Summit & Expo, the world’s leading technical event for commercial robotics developers. The event is produced by The Robot Report and WTWH Media.

The show will have more than 50 sessions in tracks on artificial intelligence, design and development, enabling technologies, healthcare, and logistics. The Engineering Theater on the show floor will also feature presentations by industry experts.

More than 70 speakers are confirmed from companies such as AWSBrain CorpFictivHarmonic Drive, maxon, PickNik Robotics, RealSense, the Robotics and AI InstituteRobust AITeslaToyota Research Institute, and more.

The Robotics Summit will also feature a number of networking opportunities. They include a Mix & Mingle Networking Reception after the first day of the show and the ticketed RBR50 Awards Dinner.

The Robotics Summit & Expo is co-located with DeviceTalks Boston, which focuses on medical devices.


SITE AD for the 2026 Robotics Summit save the date.

The post PhAIL ranks top robotics foundation models on real hardware appeared first on The Robot Report.



View Source

Popular posts from this blog

Iván Hernández Dalas: 4 Show Floor Takeaways from CES 2019: Robots and Drones, Oh My!

Iván Hernández Dalas: How automation and farm robots are transforming agriculture

Iván Hernández Dalas: Physical Intelligence open-sources Pi0 robotics foundation model