Separate and Locate: Rethink the Text in Text-based Visual Question   Answering

Chengyang Fang; Jiangnan Li; Liang Li; Can Ma; Dayong Hu

arXiv:2308.16383·cs.CV·September 1, 2023·1 cites

Separate and Locate: Rethink the Text in Text-based Visual Question Answering

Chengyang Fang, Jiangnan Li, Liang Li, Can Ma, Dayong Hu

PDF

Open Access 1 Repo

TL;DR

This paper introduces the SaL method for TextVQA that improves spatial and semantic understanding of OCR texts, leading to significant accuracy gains without pre-training.

Contribution

The paper proposes a novel approach with the Text Semantic Separate and Spatial Circle Position modules to better model spatial and semantic relations in TextVQA.

Findings

01

Outperforms baseline by 4.44% on TextVQA

02

Achieves 2.68% improvement over pre-training methods

03

Enhances spatial reasoning in OCR text analysis

Abstract

Text-based Visual Question Answering (TextVQA) aims at answering questions about the text in images. Most works in this field focus on designing network structures or pre-training tasks. All these methods list the OCR texts in reading order (from left to right and top to bottom) to form a sequence, which is treated as a natural language ``sentence''. However, they ignore the fact that most OCR words in the TextVQA task do not have a semantical contextual relationship. In addition, these approaches use 1-D position embedding to construct the spatial relation between OCR tokens sequentially, which is not reasonable. The 1-D position embedding can only represent the left-right sequence relationship between words in a sentence, but not the complex spatial position relationship. To tackle these problems, we propose a novel method named Separate and Locate (SaL) that explores text contextual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fangbufang/sal
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques

MethodsFocus