Tell Me the Evidence? Dual Visual-Linguistic Interaction for Answer   Grounding

Junwen Pan; Guanlin Chen; Yi Liu; Jiexiang Wang; Cheng Bian; Pengfei; Zhu; Zhicheng Zhang

arXiv:2207.05703·cs.CV·July 13, 2022

Tell Me the Evidence? Dual Visual-Linguistic Interaction for Answer Grounding

Junwen Pan, Guanlin Chen, Yi Liu, Jiexiang Wang, Cheng Bian, Pengfei, Zhu, Zhicheng Zhang

PDF

Open Access

TL;DR

This paper introduces DaVI, a dual visual-linguistic framework for answer grounding in VQA that enhances interpretability and flexibility by integrating visual and linguistic interactions, achieving top performance in a major challenge.

Contribution

DaVI is the first unified end-to-end model enabling both answer generation and visual evidence grounding through dual interaction mechanisms.

Findings

01

Ranked 1st in 2022 VizWiz Grand Challenge answer grounding

02

Outperforms previous methods in visual question answering interpretability

03

Demonstrates effective visual-linguistic integration

Abstract

Answer grounding aims to reveal the visual evidence for visual question answering (VQA), which entails highlighting relevant positions in the image when answering questions about images. Previous attempts typically tackle this problem using pretrained object detectors, but without the flexibility for objects not in the predefined vocabulary. However, these black-box methods solely concentrate on the linguistic generation, ignoring the visual interpretability. In this paper, we propose Dual Visual-Linguistic Interaction (DaVI), a novel unified end-to-end framework with the capability for both linguistic answering and visual grounding. DaVI innovatively introduces two visual-linguistic interaction mechanisms: 1) visual-based linguistic encoder that understands questions incorporated with visual features and produces linguistic-oriented evidence for further answer decoding, and 2)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition