HeRO: Hierarchical 3D Semantic Representation for Pose-aware Object Manipulation
Chongyang Xu, Shen Cheng, Haipeng Li, Haoqiang Fan, Ziliang Feng, Shuaicheng Liu

TL;DR
HeRO introduces a hierarchical semantic diffusion policy that combines geometry and semantics for improved pose-aware robotic manipulation, achieving state-of-the-art results across multiple tasks.
Contribution
The paper presents HeRO, a novel diffusion-based policy that integrates hierarchical semantic fields with geometric features for enhanced pose-aware manipulation.
Findings
Achieves 12.3% success improvement on Place Dual Shoes
Gains an average of 6.5% across six pose-aware tasks
Establishes new state-of-the-art performance in pose-aware manipulation
Abstract
Imitation learning for robotic manipulation has progressed from 2D image policies to 3D representations that explicitly encode geometry. Yet purely geometric policies often lack explicit part-level semantics, which are critical for pose-aware manipulation (e.g., distinguishing a shoe's toe from heel). In this paper, we present HeRO, a diffusion-based policy that couples geometry and semantics via hierarchical semantic fields. HeRO employs dense semantics lifting to fuse discriminative, geometry-sensitive features from DINOv2 with the smooth, globally coherent correspondences from Stable Diffusion, yielding dense features that are both fine-grained and spatially consistent. These features are processed and partitioned to construct a global field and a set of local fields. A hierarchical conditioning module conditions the generative denoiser on global and local fields using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · 3D Shape Modeling and Analysis · Human Pose and Action Recognition
