SCAPE: A Simple and Strong Category-Agnostic Pose Estimator
Yujia Liang, Zixuan Ye, Wenze Liu, Hao Lu

TL;DR
SCAPE introduces a simplified, attention-based approach for category-agnostic pose estimation, achieving superior accuracy and efficiency over prior methods by focusing on feature matching within a streamlined architecture.
Contribution
The paper proposes a simple, strong baseline for CAPE using pure self-attention and introduces two modules to enhance attention quality, outperforming prior arts in accuracy and speed.
Findings
Outperforms prior methods by 2.2 and 1.3 PCK in 1-shot and 5-shot settings
Faster inference speed and lighter model capacity
Effective attention process with global keypoint features and keypoint attention refiner
Abstract
Category-Agnostic Pose Estimation (CAPE) aims to localize keypoints on an object of any category given few exemplars in an in-context manner. Prior arts involve sophisticated designs, e.g., sundry modules for similarity calculation and a two-stage framework, or takes in extra heatmap generation and supervision. We notice that CAPE is essentially a task about feature matching, which can be solved within the attention process. Therefore we first streamline the architecture into a simple baseline consisting of several pure self-attention layers and an MLP regression head -- this simplification means that one only needs to consider the attention quality to boost the performance of CAPE. Towards an effective attention process for CAPE, we further introduce two key modules: i) a global keypoint feature perceptor to inject global semantic information into support keypoints, and ii) a keypoint…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · EEG and Brain-Computer Interfaces · Mechanics and Biomechanics Studies
MethodsSoftmax · Attention Is All You Need · Heatmap · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
