X-Pose: Detecting Any Keypoints
Jie Yang, Ailing Zeng, Ruimao Zhang, Lei Zhang

TL;DR
X-Pose introduces an end-to-end multi-modal prompt-based framework for detecting any keypoints across diverse objects and scenarios, supported by a large unified dataset, UniKPT, achieving significant accuracy improvements.
Contribution
The paper presents X-Pose, a novel multi-modal prompt-based keypoint detection framework and the UniKPT dataset, enabling accurate detection of diverse keypoints in complex real-world images.
Findings
X-Pose outperforms existing methods with 27.7 AP, 6.44 PCK, and 7.0 AP improvements.
The UniKPT dataset unifies 13 datasets with 338 keypoints across 1,237 categories.
X-Pose demonstrates strong generalization across styles, categories, and poses.
Abstract
This work aims to address an advanced keypoint detection problem: how to accurately detect any keypoints in complex real-world scenarios, which involves massive, messy, and open-ended objects as well as their associated keypoints definitions. Current high-performance keypoint detectors often fail to tackle this problem due to their two-stage schemes, under-explored prompt designs, and limited training data. To bridge the gap, we propose X-Pose, a novel end-to-end framework with multi-modal (i.e., visual, textual, or their combinations) prompts to detect multi-object keypoints for any articulated (e.g., human and animal), rigid, and soft objects within a given image. Moreover, we introduce a large-scale dataset called UniKPT, which unifies 13 keypoint detection datasets with 338 keypoints across 1,237 categories over 400K instances. Training with UniKPT, X-Pose effectively aligns…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
MethodsALIGN · Contrastive Learning · Focus
