PLEX: Making the Most of the Available Data for Robotic Manipulation Pretraining
Garrett Thomas, Ching-An Cheng, Ricky Loynd, Felipe Vieira Frujeri,, Vibhav Vineet, Mihai Jalobeanu, Andrey Kolobov

TL;DR
PLEX is a transformer-based model that efficiently learns robotic manipulation skills from limited task-agnostic data and abundant task-specific videos, enabling strong generalization and planning capabilities.
Contribution
The paper introduces PLEX, a novel architecture that combines small amounts of visuomotor trajectories with large-scale video data for effective robotic manipulation learning.
Findings
PLEX achieves state-of-the-art performance in Robosuite environments.
Relative positional encoding improves learning in low-data regimes.
PLEX generalizes well to unseen tasks in Meta-World.
Abstract
A rich representation is key to general robotic manipulation, but existing approaches to representation learning require large amounts of multimodal demonstrations. In this work we propose PLEX, a transformer-based architecture that learns from a small amount of task-agnostic visuomotor trajectories and a much larger amount of task-conditioned object manipulation videos -- a type of data available in quantity. PLEX uses visuomotor trajectories to induce a latent feature space and to learn task-agnostic manipulation routines, while diverse video-only demonstrations teach PLEX how to plan in the induced latent feature space for a wide variety of tasks. Experiments showcase PLEX's generalization on Meta-World and SOTA performance in challenging Robosuite environments. In particular, using relative positional encoding in PLEX's transformers greatly helps in low-data regimes of learning from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Multimodal Machine Learning Applications
