RoVi-Aug: Robot and Viewpoint Augmentation for Cross-Embodiment Robot Learning

1University of California, Berkeley, 2Toyota Research Institute, 3Physical Intelligence
* Equal Contribution
CoRL 2024 (Oral)
Teaser Image


We study cross-embodiment transfer for vision-based policies without known camera poses and robot configurations. Given robot images (left), RoVi-Aug uses state-of-the-art diffusion models to augment the data and generate synthetic images with different robots and viewpoints (right). RoVi-Aug can zero-shot deploy on a different robot with significantly different camera angles and enable transfer between robots and skills via multi-robot multi-task learning.


RoVi-Aug Pipeline

robots image
Overview of the RoVi-Aug pipeline. Given an input robot image, we first segment the robot out using a finetuned SAM model, then use a ControlNet to transform the robot into another robot. After pasting the synthetic robot back into the background, we use ZeroNVS to generate novel views.

RoVi-Aug Examples

sim experiments Image
Some example images of applying robot augmentation (middle) and view augmentation (right).


Physical Experiments

We design experiments to answer the following research questions: (1) Can robot augmentation (Ro-Aug) effectively bridge the visual gap between the robots? (2) Can viewpoint augmentation (Vi-Aug) improve policy robustness to camera pose changes? (3) Can policies trained with RoVi-Aug be successfully deployed zero-shot on a different robot with camera changes? (4) Does RoVi-Aug enable multi-robot multi-task training and better facilitate transfer between robots and skills?

sim experiments Image
Tasks used for evaluation. For each task, on the left is an example training view and robot, and on the right is the different test-time embodiment.

To answer the first three questions, we study policy transfer between a Franka and a UR5 robot on 5 tasks (Fig. 3): (1) Open a drawer, (2) Pick up a toy tiger from the table and put it into a bowl (Place Tiger), (3) Stack cups, (4) Sweep cloth from right to left, and (5) Transport a toy tiger between two bowls.


Results for the Same Camera Pose

We first investigate the zero-shot transfer between Franka and UR5 given the same camera pose. The performance and example rollout videos are shown below.

sim experiments Image
Zero-shot physical experiments evaluating robot augmentation. We evaluate Ro-Aug on 5 tasks in 2 settings with 10 trials each: Learning a policy using Franka demonstration data and evaluating on a UR5, and vice versa. The camera poses are the same.

Place Tiger

Source Demonstration (Franka Robot)

Target Result (UR5 robot)

Open Drawer

Source Demonstration (Franka robot)

Target Result (UR5 robot)

Stack Cup

Source Demonstration (Franka robot)

Target Result (UR5 robot)

Results for Different Camera Poses

We then conduct experiments on the zero-shot transfer when the camera poses are different.

sim experiments Image
The translation and rotation shows the difference in the camera poses between the robots. Mirage uses a policy trained on only the source robot with a test-time cross-painting procedure and depth reprojection to account for camera pose changes. Ro-Aug only applies robot augmentation while RoVi-Aug applies both robot and viewpoint augmentation. For both Ro-Aug and RoVi-Aug, the policy is trained on the augmented data and deployed on the target robot zero-shot.

Transfer from Franka to UR5 with changed camera poses

Transfer from Franka to UR5 for the open-drawer and place-tiger tasks. There is camera pose change of 10cm in translation and 20 degrees in rotation.

Transfer from Franka to UR5 for the open-drawer and place-tiger tasks. There is camera pose change of 25cm in translation and 35 degrees in rotation.

Transfer from Franka to UR5 for the open-drawer and place-tiger tasks. We move the camera during the trajectory and our policy trained with RoVi-Aug is still robust.

Transfer from UR5 to Franka with changed camera poses

Transfer from UR5 to Franka for the sweep-cloth and transport-tiger tasks. There is camera pose change of 10cm in translation and 20 degrees in rotation.

Transfer from UR5 to Franka for the sweep-cloth and transport-tiger tasks. There is camera pose change of 25cm in translation and 35 degrees in rotation.

Transfer from UR5 to Franka for the sweep-cloth and transport-tiger tasks. We move the camera during the trajectory and our policy trained with RoVi-Aug is still robust.

Results for Cross-skill-cross-robot and Finetuning Experiments

To evaluate robot-skill cross-product, we combine the Tiger Place demonstration data from the Franka and Tiger Transport demonstration data from the UR5, as well as their robot-augmented UR5 and Franka versions, and train a multi-robot multi-task diffusion policy. From Table 5, we can see that the policy can successfully execute the two tasks on both robots. Additionally, we evaluate whether RoVi-Aug improves finetuning sample efficiency. From Table 6, we can see that after training Octo on the augmented OXE data, the policy has seen the synthetic target robots performing the tasks, accelerating downstream finetuning of similar tasks.

sim experiments Image
Robot-skill cross-product and Octo finetuning OXE experiment results.

BibTeX

@inproceedings{chen2024roviaug,
      title={RoVi-Aug: Robot and Viewpoint Augmentation for Cross-Embodiment Robot Learning},
      author={Lawrence Yunliang Chen and Chenfeng Xu and Karthik Dharmarajan and Muhammad Zubair Irshad and Richard Cheng and Kurt Keutzer and Masayoshi Tomizuka and Quan Vuong and Ken Goldberg},
      booktitle = {Conference on Robot Learning (CoRL)},
      address  = {Munich, Germany},
      year = {2024},
}