HeRO: Hierarchical 3D Semantic Representation for Pose-aware Object Manipulation

Chongyang Xu; Shen Cheng; Haipeng Li; Haoqiang Fan; Ziliang Feng; Shuaicheng Liu

arXiv:2602.18817·cs.CV·February 24, 2026

HeRO: Hierarchical 3D Semantic Representation for Pose-aware Object Manipulation

Chongyang Xu, Shen Cheng, Haipeng Li, Haoqiang Fan, Ziliang Feng, Shuaicheng Liu

PDF

Open Access

TL;DR

HeRO introduces a hierarchical semantic diffusion policy that combines geometry and semantics for improved pose-aware robotic manipulation, achieving state-of-the-art results across multiple tasks.

Contribution

The paper presents HeRO, a novel diffusion-based policy that integrates hierarchical semantic fields with geometric features for enhanced pose-aware manipulation.

Findings

01

Achieves 12.3% success improvement on Place Dual Shoes

02

Gains an average of 6.5% across six pose-aware tasks

03

Establishes new state-of-the-art performance in pose-aware manipulation

Abstract

Imitation learning for robotic manipulation has progressed from 2D image policies to 3D representations that explicitly encode geometry. Yet purely geometric policies often lack explicit part-level semantics, which are critical for pose-aware manipulation (e.g., distinguishing a shoe's toe from heel). In this paper, we present HeRO, a diffusion-based policy that couples geometry and semantics via hierarchical semantic fields. HeRO employs dense semantics lifting to fuse discriminative, geometry-sensitive features from DINOv2 with the smooth, globally coherent correspondences from Stable Diffusion, yielding dense features that are both fine-grained and spatially consistent. These features are processed and partitioned to construct a global field and a set of local fields. A hierarchical conditioning module conditions the generative denoiser on global and local fields using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · 3D Shape Modeling and Analysis · Human Pose and Action Recognition