SA-VQA: Structured Alignment of Visual and Semantic Representations for   Visual Question Answering

Peixi Xiong; Quanzeng You; Pei Yu; Zicheng Liu; Ying Wu

arXiv:2201.10654·cs.CV·January 27, 2022·5 cites

SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering

Peixi Xiong, Quanzeng You, Pei Yu, Zicheng Liu, Ying Wu

PDF

Open Access

TL;DR

This paper introduces a structured alignment approach using graph representations to improve visual question answering by capturing deep cross-modal connections, leading to better reasoning and interpretability.

Contribution

It proposes a novel graph-based structured alignment method for VQA that enhances reasoning and interpretability without pretraining.

Findings

01

Outperforms state-of-the-art on GQA dataset.

02

Beats non-pretrained SOTA on VQA-v2.

03

Improves reasoning performance and interpretability.

Abstract

Visual Question Answering (VQA) attracts much attention from both industry and academia. As a multi-modality task, it is challenging since it requires not only visual and textual understanding, but also the ability to align cross-modality representations. Previous approaches extensively employ entity-level alignments, such as the correlations between the visual regions and their semantic labels, or the interactions across question words and object features. These attempts aim to improve the cross-modality representations, while ignoring their internal relations. Instead, we propose to apply structured alignments, which work with graph representation of visual and textual content, aiming to capture the deep connections between the visual and textual modalities. Nevertheless, it is nontrivial to represent and integrate graphs for structured alignments. In this work, we attempt to solve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning