Structural Action Transformer for 3D Dexterous Manipulation
Xiaohan Lei, Min Wang, Bohong Weng, Wengang Zhou, Houqiang Li

TL;DR
This paper introduces the Structural Action Transformer (SAT), a novel 3D manipulation policy that uses a structural-centric approach to enable effective cross-embodiment skill transfer for high-DoF robotic hands.
Contribution
The paper proposes a structural-centric action representation and an Embodied Joint Codebook to improve cross-embodiment transfer and sample efficiency in 3D dexterous manipulation.
Findings
Outperforms baseline methods in simulation and real-world tasks
Demonstrates superior sample efficiency in learning from heterogeneous datasets
Enables effective transfer of skills across different robotic hand embodiments
Abstract
Achieving human-level dexterity in robots via imitation learning from heterogeneous datasets is hindered by the challenge of cross-embodiment skill transfer, particularly for high-DoF robotic hands. Existing methods, often relying on 2D observations and temporal-centric action representation, struggle to capture 3D spatial relations and fail to handle embodiment heterogeneity. This paper proposes the Structural Action Transformer (SAT), a new 3D dexterous manipulation policy that challenges this paradigm by introducing a structural-centric perspective. We reframe each action chunk not as a temporal sequence, but as a variable-length, unordered sequence of joint-wise trajectories. This structural formulation allows a Transformer to natively handle heterogeneous embodiments, treating the joint count as a variable sequence length. To encode structural priors and resolve ambiguity, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Human Motion and Animation · Human Pose and Action Recognition
