Paparazzi: A Deep Dive into the Capabilities of Language and Vision Models for Grounding Viewpoint Descriptions
Henrik Voigt, Jan Hombeck, Monique Meuschke, Kai Lawonn, Sina, Zarrie{\ss}

TL;DR
This paper evaluates CLIP's ability to understand and identify different viewpoints of 3D objects using natural language descriptions, revealing its limitations and improvements through fine-tuning.
Contribution
It introduces an evaluation framework for 3D viewpoint grounding and demonstrates how fine-tuning enhances CLIP's performance in this task.
Findings
CLIP performs poorly on canonical views without fine-tuning
Fine-tuning with hard negatives improves viewpoint identification
Limited training data still yields significant performance gains
Abstract
Existing language and vision models achieve impressive performance in image-text understanding. Yet, it is an open question to what extent they can be used for language understanding in 3D environments and whether they implicitly acquire 3D object knowledge, e.g. about different views of an object. In this paper, we investigate whether a state-of-the-art language and vision model, CLIP, is able to ground perspective descriptions of a 3D object and identify canonical views of common objects based on text queries. We present an evaluation framework that uses a circling camera around a 3D object to generate images from different viewpoints and evaluate them in terms of their similarity to natural language descriptions. We find that a pre-trained CLIP model performs poorly on most canonical views and that fine-tuning using hard negative sampling and random contrasting yields good results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
