CapeNext: Rethinking and Refining Dynamic Support Information for Category-Agnostic Pose Estimation
Yu Zhu, Dan Zeng, Shuiwang Li, Qijun Zhao, Qiaomu Shen, Bo Tang

TL;DR
CapeNext introduces a hierarchical cross-modal interaction framework that refines joint embeddings with class-level and instance-specific cues, significantly improving category-agnostic pose estimation accuracy over existing methods.
Contribution
The paper proposes a novel framework that addresses static joint embedding limitations by integrating hierarchical cross-modal interaction and dual-stream feature refinement.
Findings
Outperforms state-of-the-art CAPE methods on MP-100 dataset
Consistent improvements across different network backbones
Effectively reduces cross-category ambiguity and intra-category variation issues
Abstract
Recent research in Category-Agnostic Pose Estimation (CAPE) has adopted fixed textual keypoint description as semantic prior for two-stage pose matching frameworks. While this paradigm enhances robustness and flexibility by disentangling the dependency of support images, our critical analysis reveals two inherent limitations of static joint embedding: (1) polysemy-induced cross-category ambiguity during the matching process(e.g., the concept "leg" exhibiting divergent visual manifestations across humans and furniture), and (2) insufficient discriminability for fine-grained intra-category variations (e.g., posture and fur discrepancies between a sleeping white cat and a standing black cat). To overcome these challenges, we propose a new framework that innovatively integrates hierarchical cross-modal interaction with dual-stream feature refinement, enhancing the joint embedding with both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Multimodal Machine Learning Applications
