RGMP: Recurrent Geometric-prior Multimodal Policy for Generalizable Humanoid Robot Manipulation
Xuetao Li, Wenke Huang, Nengyuan Pan, Kaiyan Zhao, Songhua Yang, Yiming Wang, Mengde Li, Mang Ye, Jifeng Xuan, Miao Li

TL;DR
RGMP is a novel end-to-end framework that combines geometric reasoning with data-efficient visuomotor control, enabling humanoid robots to generalize skills with minimal training data and high success rates.
Contribution
The paper introduces RGMP, integrating geometric priors into multimodal policy learning, significantly improving generalization and data efficiency in humanoid robot manipulation tasks.
Findings
Achieves 87% task success in generalization tests.
Demonstrates 5x greater data efficiency than state-of-the-art models.
Enables dexterous motion synthesis from sparse demonstrations.
Abstract
Humanoid robots exhibit significant potential in executing diverse human-level skills. However, current research predominantly relies on data-driven approaches that necessitate extensive training datasets to achieve robust multimodal decision-making capabilities and generalizable visuomotor control. These methods raise concerns due to the neglect of geometric reasoning in unseen scenarios and the inefficient modeling of robot-target relationships within the training data, resulting in significant waste of training resources. To address these limitations, we present the Recurrent Geometric-prior Multimodal Policy (RGMP), an end-to-end framework that unifies geometric-semantic skill reasoning with data-efficient visuomotor control. For perception capabilities, we propose the Geometric-prior Skill Selector, which infuses geometric inductive biases into a vision language model, producing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsRobot Manipulation and Learning · Social Robot Interaction and HRI · Human Pose and Action Recognition
