CapeNext: Rethinking and Refining Dynamic Support Information for Category-Agnostic Pose Estimation

Yu Zhu; Dan Zeng; Shuiwang Li; Qijun Zhao; Qiaomu Shen; Bo Tang

arXiv:2511.13102·cs.CV·December 16, 2025

CapeNext: Rethinking and Refining Dynamic Support Information for Category-Agnostic Pose Estimation

Yu Zhu, Dan Zeng, Shuiwang Li, Qijun Zhao, Qiaomu Shen, Bo Tang

PDF

Open Access

TL;DR

CapeNext introduces a hierarchical cross-modal interaction framework that refines joint embeddings with class-level and instance-specific cues, significantly improving category-agnostic pose estimation accuracy over existing methods.

Contribution

The paper proposes a novel framework that addresses static joint embedding limitations by integrating hierarchical cross-modal interaction and dual-stream feature refinement.

Findings

01

Outperforms state-of-the-art CAPE methods on MP-100 dataset

02

Consistent improvements across different network backbones

03

Effectively reduces cross-category ambiguity and intra-category variation issues

Abstract

Recent research in Category-Agnostic Pose Estimation (CAPE) has adopted fixed textual keypoint description as semantic prior for two-stage pose matching frameworks. While this paradigm enhances robustness and flexibility by disentangling the dependency of support images, our critical analysis reveals two inherent limitations of static joint embedding: (1) polysemy-induced cross-category ambiguity during the matching process(e.g., the concept "leg" exhibiting divergent visual manifestations across humans and furniture), and (2) insufficient discriminability for fine-grained intra-category variations (e.g., posture and fur discrepancies between a sleeping white cat and a standing black cat). To overcome these challenges, we propose a new framework that innovatively integrates hierarchical cross-modal interaction with dual-stream feature refinement, enhancing the joint embedding with both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Multimodal Machine Learning Applications