Paparazzi: A Deep Dive into the Capabilities of Language and Vision   Models for Grounding Viewpoint Descriptions

Henrik Voigt; Jan Hombeck; Monique Meuschke; Kai Lawonn; Sina; Zarrie{\ss}

arXiv:2302.10282·cs.CV·February 22, 2023·1 cites

Paparazzi: A Deep Dive into the Capabilities of Language and Vision Models for Grounding Viewpoint Descriptions

Henrik Voigt, Jan Hombeck, Monique Meuschke, Kai Lawonn, Sina, Zarrie{\ss}

PDF

Open Access

TL;DR

This paper evaluates CLIP's ability to understand and identify different viewpoints of 3D objects using natural language descriptions, revealing its limitations and improvements through fine-tuning.

Contribution

It introduces an evaluation framework for 3D viewpoint grounding and demonstrates how fine-tuning enhances CLIP's performance in this task.

Findings

01

CLIP performs poorly on canonical views without fine-tuning

02

Fine-tuning with hard negatives improves viewpoint identification

03

Limited training data still yields significant performance gains

Abstract

Existing language and vision models achieve impressive performance in image-text understanding. Yet, it is an open question to what extent they can be used for language understanding in 3D environments and whether they implicitly acquire 3D object knowledge, e.g. about different views of an object. In this paper, we investigate whether a state-of-the-art language and vision model, CLIP, is able to ground perspective descriptions of a 3D object and identify canonical views of common objects based on text queries. We present an evaluation framework that uses a circling camera around a 3D object to generate images from different viewpoints and evaluate them in terms of their similarity to natural language descriptions. We find that a pre-trained CLIP model performs poorly on most canonical views and that fine-tuning using hard negative sampling and random contrasting yields good results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training