Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

Abstract

Large-scale multi-task robotic manipulation systems often rely on text to specify the task. In this work, we explore whether a robot can learn by observing humans. To do so, the robot must understand a person’s intent and perform the inferred task despite differences in the embodiments and environments. We introduce Vid2Robot, an end-to-end video-conditioned policy that takes human videos demonstrating manipulation tasks as input and produces robot actions. Our model is trained with a large dataset of prompt video-robot trajectory pairs to learn unified representations of human and robot actions from videos. Vid2Robot uses cross-attention transformer layers between video features and the current robot state to produce the actions and perform the same task as shown in the video. We use auxiliary contrastive losses to align the prompt and robot video representations for better policies. We evaluate Vid2Robot on real-world robots and observe over 20% improvement over BC-Z when using human prompt videos. Further, we also show cross-object motion transfer ability that enables video-conditioned policies to transfer a motion observed on one object in the prompt video to another object in the robot’s own environment. Videos available at this https URL

Publication
Robotics Science and Systems (RSS)

Toronto Intelligent Systems Lab Co-authors

Maria Attarian
Maria Attarian
PhD Student

My research interests include robotic manipulation, concept and action grounding and learning from third-person demonstrations for robotic applications.

Igor Gilitschenski
Igor Gilitschenski
Assistant Professor