GenCape: Structure-Inductive Generative Modeling for Category-Agnostic Pose Estimation

Jiyong Rao; Yu Wang; Shengjie Zhao

arXiv:2605.13151·cs.CV·May 14, 2026

GenCape: Structure-Inductive Generative Modeling for Category-Agnostic Pose Estimation

Jiyong Rao, Yu Wang, Shengjie Zhao

PDF

1 Video

TL;DR

GenCape introduces a generative framework for category-agnostic pose estimation that infers instance-specific keypoint structures directly from images, improving accuracy and flexibility across diverse categories.

Contribution

The paper proposes a novel structure-aware generative model with iterative variational inference and graph transfer modules, enabling flexible, instance-specific keypoint relationship modeling without predefined skeletons.

Findings

01

Achieves significant improvements over graph-support baselines in 1- and 5-shot settings.

02

Maintains competitive performance against text-support methods.

03

Demonstrates effective structural inference from support images alone.

Abstract

Category-agnostic pose estimation (CAPE) aims to localize keypoints on query images from arbitrary categories, using only a few annotated support examples for guidance. Recent approaches either treat keypoints as isolated entities or rely on manually defined skeleton priors, which are costly to annotate and inherently inflexible across diverse categories. Such oversimplification limits the model's capacity to capture instance-wise structural cues critical for accurate pixel-level localization. To overcome these limitations, we propose GenCape, a Generative-based framework for CAPE that infers keypoint relationships solely from image-based support inputs, without additional textual descriptions or predefined skeletons. Our framework consists of two principal components: an iterative Structure-aware Variational Autoencoder (i-SVAE) and a Compositional Graph Transfer (CGT) module. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

GenCape: Structure-Inductive Generative Modeling for Category-Agnostic Pose Estimation· slideslive