Improved Visual Grounding through Self-Consistent Explanations
Ruozhen He, Paola Cascante-Bonilla, Ziyan Yang, Alexander C. Berg,, Vicente Ordonez

TL;DR
This paper introduces SelfEQ, a weakly-supervised method that enhances visual grounding in vision-language models by ensuring self-consistent explanations through paraphrase-based finetuning, leading to improved localization accuracy.
Contribution
The paper proposes a novel self-consistent explanation strategy using paraphrases and finetuning, significantly improving visual grounding performance without box annotations.
Findings
Improved accuracy on Flickr30k, ReferIt, and RefCOCO+ datasets.
Enhanced localization quality of gradient-based explanation methods.
Effective augmentation of vocabulary and object localization through self-consistency.
Abstract
Vision-and-language models trained to match images with text can be combined with visual explanation methods to point to the locations of specific objects in an image. Our work shows that the localization --"grounding"-- abilities of these models can be further improved by finetuning for self-consistent visual explanations. We propose a strategy for augmenting existing text-image datasets with paraphrases using a large language model, and SelfEQ, a weakly-supervised strategy on visual explanation maps for paraphrases that encourages self-consistency. Specifically, for an input textual phrase, we attempt to generate a paraphrase and finetune the model so that the phrase and paraphrase map to the same region in the image. We posit that this both expands the vocabulary that the model is able to handle, and improves the quality of the object locations highlighted by gradient-based visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling
