Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM
Navid Rajabi, Jana Kosecka

TL;DR
This paper introduces a new set of explainable metrics using GradCAM to evaluate how well vision-language models ground linguistic phrases in images, revealing tradeoffs related to model and dataset size.
Contribution
The paper proposes a novel suite of quantitative, GradCAM-based metrics for assessing the grounding ability of pre-trained vision-language models in a detailed and explainable manner.
Findings
GradCAM metrics effectively evaluate grounding in VLMs
Model size and dataset size influence grounding performance
Tradeoffs exist between model complexity and grounding accuracy
Abstract
Vision and Language Models (VLMs) continue to demonstrate remarkable zero-shot (ZS) performance across various tasks. However, many probing studies have revealed that even the best-performing VLMs struggle to capture aspects of compositional scene understanding, lacking the ability to properly ground and localize linguistic phrases in images. Recent VLM advancements include scaling up both model and dataset sizes, additional training objectives and levels of supervision, and variations in the model architectures. To characterize the grounding ability of VLMs, such as phrase grounding, referring expressions comprehension, and relationship understanding, Pointing Game has been used as an evaluation metric for datasets with bounding box annotations. In this paper, we introduce a novel suite of quantitative metrics that utilize GradCAM activations to rigorously evaluate the grounding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsBLIP: Bootstrapping Language-Image Pre-training · Contrastive Language-Image Pre-training · ALBEF
