Skill Transformer: A Monolithic Policy for Mobile Manipulation
Xiaoyu Huang, Dhruv Batra, Akshara Rai, Andrew Szot

TL;DR
Skill Transformer is a unified transformer-based approach that learns to predict both high-level skills and low-level actions for long-horizon mobile manipulation tasks, improving robustness and success rates.
Contribution
It introduces a monolithic transformer model that combines skill prediction and low-level control, maintaining modularity while enhancing task execution in robotic manipulation.
Findings
Achieves 2.5x higher success rate than baselines in complex rearrangement tasks.
Performs robust planning and control in new, unseen scenarios.
Effectively integrates skill modularity with end-to-end learning.
Abstract
We present Skill Transformer, an approach for solving long-horizon robotic tasks by combining conditional sequence modeling and skill modularity. Conditioned on egocentric and proprioceptive observations of a robot, Skill Transformer is trained end-to-end to predict both a high-level skill (e.g., navigation, picking, placing), and a whole-body low-level action (e.g., base and arm motion), using a transformer architecture and demonstration trajectories that solve the full task. It retains the composability and modularity of the overall task through a skill predictor module while reasoning about low-level actions and avoiding hand-off errors, common in modular approaches. We test Skill Transformer on an embodied rearrangement benchmark and find it performs robust task planning and low-level control in new scenarios, achieving a 2.5x higher success rate than baselines in hard rearrangement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Reinforcement Learning in Robotics · Human Pose and Action Recognition
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Label Smoothing · Layer Normalization · Softmax · Dense Connections
