RoVi-Aug: Robot and Viewpoint Augmentation for Cross-Embodiment Robot Learning

Lawrence Yunliang Chen*¹, Chenfeng Xu*¹, Karthik Dharmarajan¹, Muhammad Zubair Irshad², Richard Cheng², Kurt Keutzer¹, Masayoshi Tomizuka¹, Quan Vuong³, Ken Goldberg¹,

¹University of California, Berkeley, ²Toyota Research Institute, ³Physical Intelligence

^{* Equal Contribution}

CoRL 2024 (Oral)

Paper arXiv Code Media Coverage Summary

We study cross-embodiment transfer for vision-based policies without known camera poses and robot configurations. Given robot images (left), RoVi-Aug uses state-of-the-art diffusion models to augment the data and generate synthetic images with different robots and viewpoints (right). RoVi-Aug can zero-shot deploy on a different robot with significantly different camera angles and enable transfer between robots and skills via multi-robot multi-task learning.

RoVi-Aug Pipeline

robots image — Overview of the RoVi-Aug pipeline. Given an input robot image, we first segment the robot out using a finetuned SAM model, then use a ControlNet to transform the robot into another robot. After pasting the synthetic robot back into the background, we use ZeroNVS to generate novel views.

RoVi-Aug Examples

sim experiments Image — Some example images of applying robot augmentation (middle) and view augmentation (right).

Physical Experiments

We design experiments to answer the following research questions: (1) Can robot augmentation (Ro-Aug) effectively bridge the visual gap between the robots? (2) Can viewpoint augmentation (Vi-Aug) improve policy robustness to camera pose changes? (3) Can policies trained with RoVi-Aug be successfully deployed zero-shot on a different robot with camera changes? (4) Does RoVi-Aug enable multi-robot multi-task training and better facilitate transfer between robots and skills?

To answer the first three questions, we study policy transfer between a Franka and a UR5 robot on 5 tasks (Fig. 3): (1) Open a drawer, (2) Pick up a toy tiger from the table and put it into a bowl (Place Tiger), (3) Stack cups, (4) Sweep cloth from right to left, and (5) Transport a toy tiger between two bowls.

Results for the Same Camera Pose

We first investigate the zero-shot transfer between Franka and UR5 given the same camera pose. The performance and example rollout videos are shown below.

Place Tiger

Source Demonstration (Franka Robot)

Target Result (UR5 robot)

Open Drawer

Source Demonstration (Franka robot)

Target Result (UR5 robot)

Stack Cup

Source Demonstration (Franka robot)

Target Result (UR5 robot)

Results for Different Camera Poses

We then conduct experiments on the zero-shot transfer when the camera poses are different.

Transfer from Franka to UR5 with changed camera poses

Transfer from Franka to UR5 for the open-drawer and place-tiger tasks. There is camera pose change of 10cm in translation and 20 degrees in rotation.

Transfer from Franka to UR5 for the open-drawer and place-tiger tasks. There is camera pose change of 25cm in translation and 35 degrees in rotation.

Transfer from Franka to UR5 for the open-drawer and place-tiger tasks. We move the camera during the trajectory and our policy trained with RoVi-Aug is still robust.

Transfer from UR5 to Franka with changed camera poses

Transfer from UR5 to Franka for the sweep-cloth and transport-tiger tasks. There is camera pose change of 10cm in translation and 20 degrees in rotation.

Transfer from UR5 to Franka for the sweep-cloth and transport-tiger tasks. There is camera pose change of 25cm in translation and 35 degrees in rotation.

Transfer from UR5 to Franka for the sweep-cloth and transport-tiger tasks. We move the camera during the trajectory and our policy trained with RoVi-Aug is still robust.

Results for Cross-skill-cross-robot and Finetuning Experiments

To evaluate robot-skill cross-product, we combine the Tiger Place demonstration data from the Franka and Tiger Transport demonstration data from the UR5, as well as their robot-augmented UR5 and Franka versions, and train a multi-robot multi-task diffusion policy. From Table 5, we can see that the policy can successfully execute the two tasks on both robots. Additionally, we evaluate whether RoVi-Aug improves finetuning sample efficiency. From Table 6, we can see that after training Octo on the augmented OXE data, the policy has seen the synthetic target robots performing the tasks, accelerating downstream finetuning of similar tasks.

BibTeX

@inproceedings{chen2024roviaug,
      title={RoVi-Aug: Robot and Viewpoint Augmentation for Cross-Embodiment Robot Learning},
      author={Lawrence Yunliang Chen and Chenfeng Xu and Karthik Dharmarajan and Muhammad Zubair Irshad and Richard Cheng and Kurt Keutzer and Masayoshi Tomizuka and Quan Vuong and Ken Goldberg},
      booktitle = {Conference on Robot Learning (CoRL)},
      address  = {Munich, Germany},
      year = {2024},
}