Iván Hernández Dalas: Boston Dynamics and TRI use large behavior models to train Atlas humanoid

An Atlas robot handling a Spot quadruped leg. | Source: Boston Dynamics

To be useful, humanoid robots will need to be competent at many tasks, according to Boston Dynamics. They must be able to manipulate a diverse range of objects, from small, delicate objects to large, heavy ones. At the same time, they will need to coordinate their entire bodies to reconfigure themselves, their environments, avoid obstacles, and maintain balance while responding to surprises.

Boston Dynamic said it believes that building AI generalist robots is the most viable path to creating these competencies and achieving automation at scale with humanoids. The company yesterday shared some of its progress on developing large behavior models (LBMs) for its Atlas humanoid.

This work is part of a collaboration between the AI research teams at Toyota Research Institute (TRI) and Boston Dynamics. The companies said they have been building “end-to-end language-conditioned policies that enable Atlas to accomplish long-horizon manipulation tasks.”

These policies take full advantage of the capabilities of the humanoid form factor, claimed Boston Dynamics. This includes taking steps, precisely positioning its feet, crouching, shifting its center of mass, and avoiding self-collisions, all of which it said are critical to solving realistic mobile manipulation tasks.

“This work provides a glimpse into how we’re thinking about building general-purpose robots that will transform how we live and work,” said Scott Kuindersma, vice president of robotics research at Boston Dynamics. “Training a single neural network to perform many long-horizon manipulation tasks will lead to better generalization, and highly capable robots like Atlas present the fewest barriers to data collection for tasks requiring whole-body precision, dexterity, and strength.”

Boston Dynamics lays building blocks for creating policies

Boston Dynamics' process of building its policies.

Boston Dynamics’ process for building humanoid behavior policies. | Source: Boston Dynamics

Boston Dynamics said its process for building policies includes four basic steps:

Collect embodied behavior data using teleoperation on both the real robot hardware and in simulation.
Process, annotate, and curate data to incorporate into a machine learning (ML) pipeline.
Train a neural network policy using all of the data across all tasks.
Evaluate the policy using a test suite of tasks.

The company said the results of Step 4 guide its decision-making about what additional data to collect and what network architecture or inference strategies could lead to improved performance.

In implementing this process, Boston Dynamics said it followed three core principles:

Maximizing task coverage

Humanoid robots could tackle a tremendous breadth of manipulation tasks, predicted Boston Dynamics. However, collecting data beyond stationary manipulation tasks while preserving high-quality, responsive motion is challenging.

The company built a teleoperation system that combines Atlas’ model predictive controller (MPC) with a custom virtual reality (VR) interface to cover tasks ranging from finger-level dexterity to whole-body reaching and locomotion.

Boston Dynamics’ policy maps inputs consisting of images, proprioception, and language prompts to actions that control the full Atlas robot at 30Hz. It uses a diffusion transformer together with a flow matching loss to train its model. | Source: Boston Dynamics

Training generalist policies

“The field is steadily accumulating evidence that policies trained on a large corpus of diverse task data can generalize and recover better than specialist policies that are trained to solve one or a small number of tasks,” said Boston Dynamics.

The Waltham, Mass.-based company uses multi-task, language-conditioned policies to accomplish diverse tasks on multiple embodiments. These policies incorporate pretraining data from Atlas, the upper body-only Atlas Manipulation Test Stand (MTS), and TRI Ramen data.

Boston Dynamics added that building general policies enables it to simplify deployment, share policy improvements across tasks and embodiments, and move closer to unlocking emergent behaviors.

Building infrastructure to support fast iteration and rigorous science

“Being able to quickly iterate on design choices is critical, but actually measuring with confidence when one policy is better or worse than another is the key ingredient to making steady progress,” Boston Dynamics asserted.

The combination of simulation, hardware tests, and ML infrastructure built for production scale, the company said it has efficiently explored the data and policy design space while continuously improving on-robot performance.

“One of the main value propositions of humanoids is that they can achieve a huge variety of tasks directly in existing environments, but the previous approaches to programming these tasks simply could not scale to meet this challenge,” said Russ Tedrake, senior vice president of LBMs at TRI. “Large behavior models address this opportunity in a fundamentally new way – skills are added quickly via demonstrations from humans, and as the LBMs get stronger, they require less and less demonstrations to achieve more and more robust behaviors.”

The long road to end-to-end manipulation

The “Spot Workshop” task demonstrated coordinated locomotion—stepping, setting a wide stance, and squatting, said Boston Dynamics. It also showed dexterous manipulation, including part picking, regrasping, articulating, placing, and sliding. The demo consisted of three subtasks:

Grasping quadruped Spot legs from the cart, folding them, and placing them on a shelf.
Grasping face plates from the cart, then pulling out a bin on the bottom shelf, and putting the face plates in the bin.
Once the cart is fully cleared, turning to the blue bin behind and clearing it of all other Spot parts, placing handfuls of them in the blue tilt truck.

Boston Dynamics said a key feature was for its policies to react intelligently when things went wrong, such as a part falling on the ground or the bin lid closing. The initial versions of its policies didn’t have these capabilities.

By showing examples of the robot recovering from such disturbances and retraining its network, the company said it can quickly deploy new reactive policies with no algorithmic or engineering changes needed. This is because the policies can effectively estimate the state of the world from the robot’s sensors and react accordingly purely through the experiences observed in training.

“As a result, programming new manipulation behaviors no longer requires an advanced degree and years of experience, which creates a compelling opportunity to scale up behavior development for Atlas,” said Boston Dynamics.

Boston Dynamics adds manipulation capabilities

Boston Dynamics said it has studied dozens of tasks for both benchmarking and pushing the boundaries of manipulation. With a single language-conditioned policy on Atlas MTS, the company said Atlas can perform simple pick and place tasks as well as more complex ones such as tying a rope, flipping a barstool, unfurling and spreading a tablecloth, and manipulating a 22 lb. (9.9 kg) car tire.

These tasks that would be extremely difficult to perform with traditional robot programming techniques due to their deformable geometry and the complex manipulation sequences, Boston Dynamics said. But with LBMs, the training process is the same whether Atlas is stacking rigid blocks or folding a Tshirt. “If you can demonstrate it, the robot can learn it,” it said.

Boston Dynamics noted that its policies could speed up the execution at inference time without requiring any training time changes. Since the policies predict a trajectory of future actions along with the time at which those actions should be taken, it can adjust this timing to control execution speed.

Generally, the company said it can speed up policies by 1.5x to 2x without significantly affecting policy performance on both the MTS and full Atlas platforms. While the task dynamics can sometimes preclude this kind of inference-time speedup, Boston Dynamics said it suggests that, in some cases, the robot can exceed the speed limits of human teleoperation.

Teleoperation enables high-quality data collection

Atlas contains 78 degrees of freedom (DoF) that provide a wide range of motion and a high degree of dexterity. The Atlas MTS has 29 DoF to explore pure manipulation tasks. The grippers each have 7 DoF that enable the robot to use a wide range of grasping strategies, such as power grasps or pinch grasps.

Boston Dynamics relies on a pair of HDR stereo cameras mounted in the head to provide both situational awareness for teleoperation and visual input for its policies.

Controlling the robot in a fluid, dynamic, and dexterous manner is crucial, said the company, which has invested heavily in its teleoperation system to address these needs. It is built on Boston Dynamics’ MPC system, which it previously used to demonstrate Atlas conducting parkour, dance, and both practical and impractical manipulation.

This control system allows the company to perform precise manipulation while maintaining balance and avoiding self-collisions, enabling it to push the boundaries of what it can do with the Atlas hardware.

The remote operator wears a VR headset to be fully immersed in the robot’s workspace and have access to the same information as the policy. Spatial awareness is bolstered by a stereoscopic view rendered using Atlas’ head-mounted cameras reprojected to the user’s viewpoint, said Boston Dynamics.

Custom VR software provides teleoperators with a rich interface to command the robot, providing them with real-time feeds of the robots’ state, control targets, sensor readings, tactile feedback, and system state via augmented reality, controller haptics, and heads-up display elements. Boston Dynamics said this enables teleoperators to make full use of the robot hardware, synchronizing their body and senses with the robot.

SITE AD for the 2025 RoboBusiness registration open.

Boston Dynamics upgrades VR setup for manipulation

The initial version of the VR teleoperation application used the headset, base stations, controllers, and one tracker for the chest to control Atlas while standing still. This system employed a one-to-one mapping between the user and the robot (i.e., moving your hand 1 cm would cause the robot to also move by 1 cm), which yields an intuitive control experience, especially for bi-manual tasks.

With this version, the operator was already able to perform a wide range of tasks, such as crouching down low to reach an object on the ground and also standing tall to reach a high shelf. However, one limitation of this system is that it didn’t allow the operator to dynamically reposition the feet and take steps, which significantly limited the tasks it could perform.

To support mobile manipulation, Boston Dynamics incorporated two additional trackers for 1-to-1 tracking on the feet and extended the teleoperation control such that Atlas’s stance mode, support polygon, and stepping intent matched that of the operator. In addition to supporting locomotion, the company said this setup allowed it to take full advantage of Atlas’ workspace.

For instance, when opening a blue tote on the ground and picking items from inside, the human must be able to configure the robot with a wide stance and bent knees to reach the objects in the bin without colliding with the bin.

Boston Dynamics’ neural network policies use the same control interface to the robot as the teleoperation system, which made it easy to reuse model architectures it had developed for policies that didn’t involve locomotion. Now, it can simply augment the action representation.

TRI LBMs enable Boston Dynamics’ policy

TRI’s LBMs received a 2024 RBR50 Robotics Innovation Award. Boston Dynamics said it builds on them to scale diffusion policy-like architectures, using a 450 million-parameter diffusion transformer architecture with a flow-matching objective.

The policy is conditioned on proprioception, images, and also accepts a language prompt that specifies the objective to the robot. Image data comes in at 30 Hz, and its network uses a history of observations to predict an action chunk of length 48 (corresponding to 1.6 seconds), where generally 24 actions (0.8 seconds when running at 1x speed) are executed each time policy inference is run.

The policy’s observation space for Atlas consists of the images from the robot’s head-mounted cameras along with proprioception. The action space includes the joint positions for the left and right grippers, neck yaw, torso pose, left and right hand pose, and the left and right foot poses.

Atlas MTS is identical to the upper-body on Atlas, both from a mechanical and a software perspective. The observation and action spaces are the same as for Atlas, simply with the torso and lower body components omitted. This shared hardware and software across Atlas and Atlas MTS allows Boston Dynamics to pool data from both embodiments for training.

These policies were trained on data that the team continuously collected and iterated upon, where high-quality demonstrations were a critical part of getting successful policies. Boston Dynamics heavily relied upon its quality assurance tooling, which allowed it to review, filter, and provide feedback on the data collected.

Boston Dynamics quickly iterates with simulation

Boston Dynamics said simulation is a critical tool that allows it to quickly iterate on the teleoperation system, write unit and integration tests to ensure the company can move forward without breakages. It also enables the company to perform informative training and evaluations that would otherwise be slower, more expensive, and difficult to perform repeatably on hardware.

Because Boston Dynamics’ simulation stack is a faithful representation of the hardware and on-robot software stack, the company is able to share its data pipeline, visualization tools, training code, VR software, and interfaces across both simulation and hardware platforms.

In addition to using simulation to benchmark its policy and architecture choices, Boston Dynamics also uses it as a significant co-training data source for its multi-task and multi-embodiment policies that it deploys on the hardware.

What are the next steps for Atlas?

So far, Boston Dynamics has shown that it can train multi-task language-conditioned policies that can control Atlas to accomplish long-horizon tasks that involve both locomotion and dexterous whole-body manipulation. The company said its data-driven approach is general and can be used for practically any downstream task that can be demonstrated via teleoperation.

While Boston Dynamics said it is encouraged by the results so far, it acknowledged that there is still much work to be done. With its established baseline of tasks and performance, the company said it plans to focus on scaling its “data flywheel” to increase throughput, quality, task diversity, and difficulty while also exploring new algorithmic ideas.

The company wrote in a blog post that it is continuing research in several directions, including performance-related robotics topics such as gripper force control with tactile feedback and fast dynamic manipulation. It is also looking at incorporating diverse data sources including cross-embodiment, ego-centric human data, etc.

Finally, Boston Dynamics said it is interested in reinforcement learning (RL) improvement of vision-language-action models (VLAs), as well as in deploying vision-language model (VLM) and VLA architectures to enable more complex long-horizon tasks and open-ended reasoning.

Learn about the latest in AI at RoboBusiness

This year’s RoboBusiness, which will be on Oct. 15 and 16 in Santa Clara, Calif., will feature the Physical AI Forum. This track will feature talks about a range of topics, including conversations around safety and AI, simulation-to-reality reinforcement training, data curation, deploying AI-powered robots, and more.

Attendees can hear from experts from Dexterity, ABB Robotics, UC Berkeley, Roboto, GrayMatter Robotics, Diligent Robotics, and Dexman AI. In addition, the show will start with a keynote from Deepu Talla, the vice president of robotics at edge AI at NVIDIA, on how physical AI is ushering in a new era of robotics.

RoboBusiness is the premier event for developers and suppliers of commercial robots. The event is produced by WTWH Media, which also produces The Robot Report, Automated Warehouse, and the Robotics Summit & Expo.

This year’s conference will include more than 60 speakers, a track on humanoids, a startup workshop, the annual Pitchfire competition, and numerous networking opportunities. Over 100 exhibitors on the show floor will showcase their latest enabling technologies, products, and services to help solve your robotics development challenges.

Registration is now open for RoboBusiness 2025.

The post Boston Dynamics and TRI use large behavior models to train Atlas humanoid appeared first on The Robot Report.

View Source

Search This Blog

Ivan Hernandez Dalas - Mechatronics