GravMAD: Grounded Spatial Value Maps Guided Action Diffusion for   Generalized 3D Manipulation

Yangtao Chen; Zixuan Chen; Junhui Yin; Jing Huo; Pinzhuo Tian; Jieqi; Shi; Yang Gao

arXiv:2409.20154·cs.RO·March 18, 2025

GravMAD: Grounded Spatial Value Maps Guided Action Diffusion for Generalized 3D Manipulation

Yangtao Chen, Zixuan Chen, Junhui Yin, Jing Huo, Pinzhuo Tian, Jieqi, Shi, Yang Gao

PDF

Open Access

TL;DR

GravMAD is a novel framework that combines imitation learning and foundation models to improve robot manipulation of 3D objects based on language instructions, especially for unseen tasks.

Contribution

It introduces a sub-goal-driven, language-conditioned action diffusion method with GravMaps for flexible 3D guidance and a new Sub-goal Keypose Discovery process for better task understanding.

Findings

01

28.63% improvement on novel tasks

02

13.36% gain on seen tasks

03

Effective generalization to real-world tasks

Abstract

Robots' ability to follow language instructions and execute diverse 3D manipulation tasks is vital in robot learning. Traditional imitation learning-based methods perform well on seen tasks but struggle with novel, unseen ones due to variability. Recent approaches leverage large foundation models to assist in understanding novel tasks, thereby mitigating this issue. However, these methods lack a task-specific learning process, which is essential for an accurate understanding of 3D environments, often leading to execution failures. In this paper, we introduce GravMAD, a sub-goal-driven, language-conditioned action diffusion framework that combines the strengths of imitation learning and foundation models. Our approach breaks tasks into sub-goals based on language instructions, allowing auxiliary guidance during both training and inference. During training, we introduce Sub-goal Keypose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Robot Manipulation and Learning · Human Motion and Animation