A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models
Noriyuki Kojima, Hadar Averbuch-Elor, Yoav Artzi

TL;DR
This paper introduces a framework and benchmarks to study the relationship between phrase grounding and task performance in vision-language models, revealing inconsistencies and ways to improve grounding through targeted training.
Contribution
It presents a novel framework and benchmarks for jointly analyzing phrase grounding and task performance, highlighting inconsistencies and training strategies to enhance grounding.
Findings
Contemporary models show inconsistency between grounding ability and task performance.
Brute-force training on grounding annotations improves grounding consistency.
Analysis of training dynamics reveals factors influencing grounding and task success.
Abstract
Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions. However, observing this grounding in contemporary models is complex, even if it is generally expected to take place if the task is addressed in a way that is conductive to generalization. We propose a framework to jointly study task performance and phrase grounding, and propose three benchmarks to study the relation between the two. Our results show that contemporary models demonstrate inconsistency between their ability to ground phrases and solve tasks. We show how this can be addressed through brute-force training on ground phrasing annotations, and analyze the dynamics it creates. Code and at available at https://github.com/lil-lab/phrase_grounding.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
