PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training
Rogerio Bonatti, Sai Vemprala, Shuang Ma, Felipe Frujeri, Shuhang, Chen, Ashish Kapoor

TL;DR
This paper introduces PACT, a transformer-based pre-training method for robots that learns general representations from data, enabling efficient multi-task adaptation for navigation, localization, and mapping.
Contribution
The work presents a novel self-supervised pre-training approach for robots using a causal transformer, facilitating multi-task learning with improved efficiency and performance.
Findings
Pretrained PACT models outperform training from scratch on multiple tasks.
Finetuning small task-specific networks yields significant performance gains.
Shared representations reduce model capacity and improve deployment speed.
Abstract
Robotics has long been a field riddled with complex systems architectures whose modules and connections, whether traditional or learning-based, require significant human expertise and prior knowledge. Inspired by large pre-trained language models, this work introduces a paradigm for pre-training a general purpose representation that can serve as a starting point for multiple tasks on a given robot. We present the Perception-Action Causal Transformer (PACT), a generative transformer-based architecture that aims to build representations directly from robot data in a self-supervised fashion. Through autoregressive prediction of states and actions over time, our model implicitly encodes dynamics and behaviors for a particular robot. Our experimental evaluation focuses on the domain of mobile agents, where we show that this robot-specific representation can function as a single starting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Anomaly Detection Techniques and Applications · Modular Robots and Swarm Intelligence
MethodsAttention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Softmax · Dropout · Label Smoothing
