Learning Fine-Grained Correspondence with Cross-Perspective Perception for Open-Vocabulary 6D Object Pose Estimation

Yu Qin; Shimeng Fan; Fan Yang; Zixuan Xue; Zijie Mai; Wenrui Chen; Kailun Yang; Zhiyong Li

arXiv:2601.13565·cs.CV·January 21, 2026

Learning Fine-Grained Correspondence with Cross-Perspective Perception for Open-Vocabulary 6D Object Pose Estimation

Yu Qin, Shimeng Fan, Fan Yang, Zixuan Xue, Zijie Mai, Wenrui Chen, Kailun Yang, Zhiyong Li

PDF

Open Access

TL;DR

FiCoP introduces a patch-level correspondence framework with structural priors and dual-view fusion to enhance open-vocabulary 6D object pose estimation in complex environments.

Contribution

It proposes a novel fine-grained matching approach using patch correlation and cross-perspective perception, improving robustness over global matching strategies.

Findings

01

Achieves 8.0% higher Average Recall on REAL275 dataset.

02

Outperforms state-of-the-art by 6.1% on Toyota-Light dataset.

03

Demonstrates robustness in complex, unconstrained environments.

Abstract

Open-vocabulary 6D object pose estimation empowers robots to manipulate arbitrary unseen objects guided solely by natural language. However, a critical limitation of existing approaches is their reliance on unconstrained global matching strategies. In open-world scenarios, trying to match anchor features against the entire query image space introduces excessive ambiguity, as target features are easily confused with background distractors. To resolve this, we propose Fine-grained Correspondence Pose Estimation (FiCoP), a framework that transitions from noise-prone global matching to spatially-constrained patch-level correspondence. Our core innovation lies in leveraging a patch-to-patch correlation matrix as a structural prior to narrowing the matching scope, effectively filtering out irrelevant clutter to prevent it from degrading pose estimation. Firstly, we introduce an object-centric…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Robotics and Sensor-Based Localization