AnyOKP: One-Shot and Instance-Aware Object Keypoint Extraction with Pretrained ViT
Fangbo Qin, Taogang Hou, Shan Lin, Kaiyuan Wang, Michael C. Yip, Shan, Yu

TL;DR
AnyOKP introduces a one-shot, instance-aware object keypoint extraction method leveraging pretrained ViT, capable of identifying keypoints across multiple object instances and categories with high robustness to domain shifts and viewpoint changes.
Contribution
The paper presents a novel one-shot keypoint extraction approach using pretrained ViT that is generalizable, transferable, and does not require training for new object instances.
Findings
Effective on real robot-captured images across various domains
Demonstrates high cross-category flexibility and instance awareness
Shows robustness to domain shift and viewpoint variation
Abstract
Towards flexible object-centric visual perception, we propose a one-shot instance-aware object keypoint (OKP) extraction approach, AnyOKP, which leverages the powerful representation ability of pretrained vision transformer (ViT), and can obtain keypoints on multiple object instances of arbitrary category after learning from a support image. An off-the-shelf petrained ViT is directly deployed for generalizable and transferable feature extraction, which is followed by training-free feature enhancement. The best-prototype pairs (BPPs) are searched for in support and query images based on appearance similarity, to yield instance-unaware candidate keypoints.Then, the entire graph with all candidate keypoints as vertices are divided to sub-graphs according to the feature distributions on the graph edges. Finally, each sub-graph represents an object instance. AnyOKP is evaluated on real…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Robotics and Sensor-Based Localization
MethodsAttention Is All You Need · Softmax · Dense Connections · Linear Layer · Residual Connection · Multi-Head Attention · Layer Normalization · Vision Transformer
