ZeroKey: Point-Level Reasoning and Zero-Shot 3D Keypoint Detection from Large Language Models
Bingchen Gong, Diego Gomez, Abdullah Hamdi, Abdelrahman Eldesokey,, Ahmed Abdelreheem, Peter Wonka, Maks Ovsjanikov

TL;DR
ZeroKey leverages multi-modal large language models to detect and name keypoints on 3D shapes without any ground truth labels, achieving competitive results in a zero-shot setting.
Contribution
This work introduces a novel zero-shot method that exploits pixel-level annotations in MLLMs for 3D keypoint detection without supervision.
Findings
Achieves competitive performance on standard benchmarks.
No need for annotated 3D keypoints during training.
Demonstrates the potential of language models in 3D shape understanding.
Abstract
We propose a novel zero-shot approach for keypoint detection on 3D shapes. Point-level reasoning on visual data is challenging as it requires precise localization capability, posing problems even for powerful models like DINO or CLIP. Traditional methods for 3D keypoint detection rely heavily on annotated 3D datasets and extensive supervised training, limiting their scalability and applicability to new categories or domains. In contrast, our method utilizes the rich knowledge embedded within Multi-Modal Large Language Models (MLLMs). Specifically, we demonstrate, for the first time, that pixel-level annotations used to train recent MLLMs can be exploited for both extracting and naming salient keypoints on 3D models without any ground truth labels or supervision. Experimental evaluations demonstrate that our approach achieves competitive performance on standard benchmarks compared to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling
MethodsAttention Is All You Need · Linear Layer · Softmax · Dense Connections · Residual Connection · Multi-Head Attention · Layer Normalization · Vision Transformer · self-DIstillation with NO labels · Contrastive Language-Image Pre-training
