SceneGATE: Scene-Graph based co-Attention networks for TExt visual   question answering

Feiqi Cao; Siwen Luo; Felipe Nunez; Zean Wen; Josiah Poon; Caren Han

arXiv:2212.08283·cs.CV·August 8, 2023·1 cites

SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering

Feiqi Cao, Siwen Luo, Felipe Nunez, Zean Wen, Josiah Poon, Caren Han

PDF

Open Access

TL;DR

SceneGATE introduces a scene graph-based co-attention network for TextVQA that models semantic relations among objects, OCR tokens, and question words, significantly improving performance on benchmark datasets.

Contribution

The paper presents a novel scene graph-based co-attention network that explicitly models semantic relations for TextVQA, enhancing multimodal interaction understanding.

Findings

01

Outperforms existing methods on Text-VQA and ST-VQA datasets.

02

Effectively captures intra- and inter-modal semantic relations.

03

Improves accuracy by leveraging scene graph and specialized attention modules.

Abstract

Most TextVQA approaches focus on the integration of objects, scene texts and question words by a simple transformer encoder. But this fails to capture the semantic relations between different modalities. The paper proposes a Scene Graph based co-Attention Network (SceneGATE) for TextVQA, which reveals the semantic relations among the objects, Optical Character Recognition (OCR) tokens and the question words. It is achieved by a TextVQA-based scene graph that discovers the underlying semantics of an image. We created a guided-attention module to capture the intra-modal interplay between the language and the vision as a guidance for inter-modal interactions. To make explicit teaching of the relations between the two modalities, we proposed and integrated two attention modules, namely a scene graph-based semantic relation-aware attention and a positional relation-aware attention. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques