Tell Me the Evidence? Dual Visual-Linguistic Interaction for Answer Grounding
Junwen Pan, Guanlin Chen, Yi Liu, Jiexiang Wang, Cheng Bian, Pengfei, Zhu, Zhicheng Zhang

TL;DR
This paper introduces DaVI, a dual visual-linguistic framework for answer grounding in VQA that enhances interpretability and flexibility by integrating visual and linguistic interactions, achieving top performance in a major challenge.
Contribution
DaVI is the first unified end-to-end model enabling both answer generation and visual evidence grounding through dual interaction mechanisms.
Findings
Ranked 1st in 2022 VizWiz Grand Challenge answer grounding
Outperforms previous methods in visual question answering interpretability
Demonstrates effective visual-linguistic integration
Abstract
Answer grounding aims to reveal the visual evidence for visual question answering (VQA), which entails highlighting relevant positions in the image when answering questions about images. Previous attempts typically tackle this problem using pretrained object detectors, but without the flexibility for objects not in the predefined vocabulary. However, these black-box methods solely concentrate on the linguistic generation, ignoring the visual interpretability. In this paper, we propose Dual Visual-Linguistic Interaction (DaVI), a novel unified end-to-end framework with the capability for both linguistic answering and visual grounding. DaVI innovatively introduces two visual-linguistic interaction mechanisms: 1) visual-based linguistic encoder that understands questions incorporated with visual features and produces linguistic-oriented evidence for further answer decoding, and 2)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
