Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations
Ziyan Yang, Kushal Kafle, Franck Dernoncourt, Vicente Ordonez

TL;DR
This paper introduces Attention Mask Consistency (AMC), a margin-based loss that improves visual grounding by aligning gradient explanations with human-annotated regions, achieving state-of-the-art results on multiple benchmarks.
Contribution
The paper proposes a novel AMC loss that enhances vision-language models' grounding accuracy by enforcing explanation consistency with region annotations.
Findings
Achieved 86.49% accuracy on Flickr30k benchmark, surpassing previous models.
Performed well on RefCOCO+ with 80.34% accuracy in the easy split.
Demonstrated the method's effectiveness, simplicity, and generality across models and annotations.
Abstract
We propose a margin-based loss for tuning joint vision-language models so that their gradient-based explanations are consistent with region-level annotations provided by humans for relatively smaller grounding datasets. We refer to this objective as Attention Mask Consistency (AMC) and demonstrate that it produces superior visual grounding results than previous methods that rely on using vision-language models to score the outputs of object detectors. Particularly, a model trained with AMC on top of standard vision-language modeling objectives obtains a state-of-the-art accuracy of 86.49% in the Flickr30k visual grounding benchmark, an absolute improvement of 5.38% when compared to the best previous model trained under the same level of supervision. Our approach also performs exceedingly well on established benchmarks for referring expression comprehension where it obtains 80.34%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)
MethodsRoIPool · Convolution · Region Proposal Network · Softmax · ALIGN · Faster R-CNN
