MK-Pose: Category-Level Object Pose Estimation via Multimodal-Based Keypoint Learning

Yifan Yang; Peili Song; Enfan Lan; Dong Liu; Jingtai Liu

arXiv:2507.06662·cs.CV·July 10, 2025

MK-Pose: Category-Level Object Pose Estimation via Multimodal-Based Keypoint Learning

Yifan Yang, Peili Song, Enfan Lan, Dong Liu, Jingtai Liu

PDF

Open Access

TL;DR

MK-Pose introduces a multimodal framework combining RGB, point clouds, and text for category-level object pose estimation, improving accuracy and robustness against occlusion and variation.

Contribution

The paper presents a novel multimodal keypoint learning framework with attention and graph modules, enhancing pose estimation without shape priors.

Findings

01

Outperforms state-of-the-art in IoU and average precision

02

Effective cross-dataset generalization demonstrated

03

No reliance on shape priors needed

Abstract

Category-level object pose estimation, which predicts the pose of objects within a known category without prior knowledge of individual instances, is essential in applications like warehouse automation and manufacturing. Existing methods relying on RGB images or point cloud data often struggle with object occlusion and generalization across different instances and categories. This paper proposes a multimodal-based keypoint learning framework (MK-Pose) that integrates RGB images, point clouds, and category-level textual descriptions. The model uses a self-supervised keypoint detection module enhanced with attention-based query generation, soft heatmap matching and graph-based relational modeling. Additionally, a graph-enhanced feature fusion module is designed to integrate local geometric information and global context. MK-Pose is evaluated on CAMERA25 and REAL275 dataset, and is further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Image and Object Detection Techniques

MethodsHeatmap