CapeX: Category-Agnostic Pose Estimation from Textual Point Explanation
Matan Rusanovsky, Or Hirschorn, Shai Avidan

TL;DR
This paper introduces a text-based, category-agnostic pose estimation method that uses textual keypoint descriptions and a pose-graph, outperforming previous image-based approaches and advancing the state-of-the-art on a diverse benchmark.
Contribution
It proposes a novel text-based approach for category-agnostic pose estimation, eliminating the need for support images and improving generalization and occlusion handling.
Findings
Achieves a 1.07% performance boost on MP-100 benchmark.
Introduces text description annotations to enrich dataset utility.
Establishes a new state-of-the-art in 1-shot CAPE.
Abstract
Conventional 2D pose estimation models are constrained by their design to specific object categories. This limits their applicability to predefined objects. To overcome these limitations, category-agnostic pose estimation (CAPE) emerged as a solution. CAPE aims to facilitate keypoint localization for diverse object categories using a unified model, which can generalize from minimal annotated support images. Recent CAPE works have produced object poses based on arbitrary keypoint definitions annotated on a user-provided support image. Our work departs from conventional CAPE methods, which require a support image, by adopting a text-based approach instead of the support image. Specifically, we use a pose-graph, where nodes represent keypoints that are described with text. This representation takes advantage of the abstraction of text descriptions and the structure imposed by the graph.…
Peer Reviews
Decision·ICLR 2025 Poster
This work focuses on the interesting and important task of category-agnostic pose estimation (CAPE). The proposed CapeX utilizes the abstract textual description for keypoint detection to improve the human-computer interaction. Graph structure is applied to capture the relationship between keypoints. The proposed CapeX achieves the state-of-the-art performance on the MP-100 dataset.
Pose Anything has designed the graph structure to capture the keypoint correlations, and the textual prompts have been explored for pose estimation in recent works such as KDSM. X-Pose has also provided textual annotations on the MP-100 dataset. Therefore, the contribution of this paper is somewhat incremental. The graph construction is critical for graph-based approach. Also, manual design of node connections may introduce extra empirical knowledge, leading to unfair comparison. To avoid the a
1. This paper addresses a very important issue: relying on visual information from a support image to locate keypoints in a test image is not reliable. 2. The use of a graph and text representation is intriguing, and I believe it provides a simple yet effective solution to the challenges associated with relying solely on visual information for localization. 3. The paper expands the MP-100 dataset, enabling graph-based CAPE. 4. The experiments in the paper are thorough, and the comparisons are
I don’t have many questions regarding this paper, just a few minor issues. 1. Were all the keypoint texts in the dataset generated by off-the-shelf foundation models? When many keypoints are very close to each other and have highly similar semantics, how are these very similar keypoints distinguished in the text? Could you provide some examples? 2. The authors could provide more analysis regarding the limitations of the paper, explaining why certain failure cases occur.
1. The motivation is clear and the paper reads smooth. 2. The design of the presented method is reasonable. 3. Experiments are conducted on the standard MP-100 dataset, establishing a new state-of-the-art for CAPE. The presented method shows superiority to previous CAPE models. Especially, the experimental analysis (e.g. Figure 6 and 7) is very interesting.
1. The novelty of the presented paper is a little bit concerning. (1) Open-vocabulary (textual prompt) keypoint estimation is not new, there are works such as CLAMP (Zhang et al. 2023) and KDSM (Zhang et al. 2024). (2) Graph representation for few-shot/zero-shot keypoint estimation is not new, as there are works like GraphCape (Hirschorn & Avidan, 2023). Could you highlight the difference between these methods? 2. Pose-Graph seems to be the most important contribution. However, experimental a
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
