Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM

Navid Rajabi; Jana Kosecka

arXiv:2404.19128·cs.CV·May 1, 2024

Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM

Navid Rajabi, Jana Kosecka

PDF

Open Access

TL;DR

This paper introduces a new set of explainable metrics using GradCAM to evaluate how well vision-language models ground linguistic phrases in images, revealing tradeoffs related to model and dataset size.

Contribution

The paper proposes a novel suite of quantitative, GradCAM-based metrics for assessing the grounding ability of pre-trained vision-language models in a detailed and explainable manner.

Findings

01

GradCAM metrics effectively evaluate grounding in VLMs

02

Model size and dataset size influence grounding performance

03

Tradeoffs exist between model complexity and grounding accuracy

Abstract

Vision and Language Models (VLMs) continue to demonstrate remarkable zero-shot (ZS) performance across various tasks. However, many probing studies have revealed that even the best-performing VLMs struggle to capture aspects of compositional scene understanding, lacking the ability to properly ground and localize linguistic phrases in images. Recent VLM advancements include scaling up both model and dataset sizes, additional training objectives and levels of supervision, and variations in the model architectures. To characterize the grounding ability of VLMs, such as phrase grounding, referring expressions comprehension, and relationship understanding, Pointing Game has been used as an evaluation metric for datasets with bounding box annotations. In this paper, we introduce a novel suite of quantitative metrics that utilize GradCAM activations to rigorously evaluate the grounding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsBLIP: Bootstrapping Language-Image Pre-training · Contrastive Language-Image Pre-training · ALBEF