Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation
Mohit Shridhar, Lucas Manuelli, Dieter Fox

TL;DR
Perceiver-Actor is a multi-task transformer model that leverages 3D voxel observations and language goals to efficiently learn diverse robotic manipulation tasks from limited data.
Contribution
This paper introduces PerAct, a novel transformer-based framework that encodes 3D voxel observations and language instructions for multi-task robotic manipulation, demonstrating superior performance with limited data.
Findings
Outperforms image-to-action agents and 3D ConvNets on various tasks
Learns 18 RLBench tasks and 7 real-world tasks with few demonstrations
Effectively encodes language goals and 3D observations for manipulation
Abstract
Transformers have revolutionized vision and natural language processing with their ability to scale with large datasets. But in robotic manipulation, data is both limited and expensive. Can manipulation still benefit from Transformers with the right problem formulation? We investigate this question with PerAct, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation. PerAct encodes language goals and RGB-D voxel observations with a Perceiver Transformer, and outputs discretized actions by ``detecting the next best voxel action''. Unlike frameworks that operate on 2D images, the voxelized 3D observation and action space provides a strong structural prior for efficiently learning 6-DoF actions. With this formulation, we train a single multi-task Transformer for 18 RLBench tasks (with 249 variations) and 7 real-world tasks (with 18 variations) from just a few…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Softmax · Absolute Position Encodings · Dropout · Dense Connections
