M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place
Wentao Yuan, Adithyavairavan Murali, Arsalan Mousavian, Dieter Fox

TL;DR
M2T2 is a transformer-based model that integrates multiple low-level manipulation actions for object-centric pick-and-place tasks, demonstrating robust zero-shot sim2real transfer and superior performance in complex, cluttered scenes.
Contribution
The paper introduces M2T2, a unified transformer model capable of multiple low-level actions, bridging the gap between high-level decision-making and low-level manipulation for arbitrary objects.
Findings
Achieves zero-shot sim2real transfer on real robots.
Outperforms state-of-the-art task-specific models by 19% overall.
Excels in challenging scenes requiring object re-orientation.
Abstract
With the advent of large language models and large-scale robotic datasets, there has been tremendous progress in high-level decision-making for object manipulation. These generic models are able to interpret complex tasks using language commands, but they often have difficulties generalizing to out-of-distribution objects due to the inability of low-level action primitives. In contrast, existing task-specific models excel in low-level manipulation of unknown objects, but only work for a single type of action. To bridge this gap, we present M2T2, a single model that supplies different types of low-level actions that work robustly on arbitrary objects in cluttered scenes. M2T2 is a transformer model which reasons about contact points and predicts valid gripper poses for different action modes given a raw point cloud of the scene. Trained on a large-scale synthetic dataset with 128K…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Advanced Neural Network Applications · Multimodal Machine Learning Applications
