Deconfounded Visual Grounding
Jianqiang Huang, Yu Qin, Jiaxin Qi, Qianru Sun, Hanwang Zhang

TL;DR
This paper identifies confounding bias in visual grounding caused by language-location correlations and proposes a causal, deconfounded approach called RED that improves grounding accuracy across benchmarks.
Contribution
It introduces a causal framework for visual grounding and a novel deconfounding method, RED, that enhances existing models by removing confounding bias.
Findings
RED significantly improves state-of-the-art grounding methods.
The causal graph clarifies the confounding bias source.
RED is applicable to any grounding method.
Abstract
We focus on the confounding bias between language and location in the visual grounding pipeline, where we find that the bias is the major visual reasoning bottleneck. For example, the grounding process is usually a trivial language-location association without visual reasoning, e.g., grounding any language query containing sheep to the nearly central regions, due to that most queries about sheep have ground-truth locations at the image center. First, we frame the visual grounding pipeline into a causal graph, which shows the causalities among image, query, target location and underlying confounder. Through the causal graph, we know how to break the grounding bottleneck: deconfounded visual grounding. Second, to tackle the challenge that the confounder is unobserved in general, we propose a confounder-agnostic approach called: Referring Expression Deconfounder (RED), to remove the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
