A Joint Study of Phrase Grounding and Task Performance in Vision and   Language Models

Noriyuki Kojima; Hadar Averbuch-Elor; Yoav Artzi

arXiv:2309.02691·cs.CL·June 3, 2024·1 cites

A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models

Noriyuki Kojima, Hadar Averbuch-Elor, Yoav Artzi

PDF

Open Access 1 Repo

TL;DR

This paper introduces a framework and benchmarks to study the relationship between phrase grounding and task performance in vision-language models, revealing inconsistencies and ways to improve grounding through targeted training.

Contribution

It presents a novel framework and benchmarks for jointly analyzing phrase grounding and task performance, highlighting inconsistencies and training strategies to enhance grounding.

Findings

01

Contemporary models show inconsistency between grounding ability and task performance.

02

Brute-force training on grounding annotations improves grounding consistency.

03

Analysis of training dynamics reveals factors influencing grounding and task success.

Abstract

Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions. However, observing this grounding in contemporary models is complex, even if it is generally expected to take place if the task is addressed in a way that is conductive to generalization. We propose a framework to jointly study task performance and phrase grounding, and propose three benchmarks to study the relation between the two. Our results show that contemporary models demonstrate inconsistency between their ability to ground phrases and solve tasks. We show how this can be addressed through brute-force training on ground phrasing annotations, and analyze the dynamics it creates. Code and at available at https://github.com/lil-lab/phrase_grounding.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lil-lab/phrase_grounding
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling