Separate and Locate: Rethink the Text in Text-based Visual Question Answering
Chengyang Fang, Jiangnan Li, Liang Li, Can Ma, Dayong Hu

TL;DR
This paper introduces the SaL method for TextVQA that improves spatial and semantic understanding of OCR texts, leading to significant accuracy gains without pre-training.
Contribution
The paper proposes a novel approach with the Text Semantic Separate and Spatial Circle Position modules to better model spatial and semantic relations in TextVQA.
Findings
Outperforms baseline by 4.44% on TextVQA
Achieves 2.68% improvement over pre-training methods
Enhances spatial reasoning in OCR text analysis
Abstract
Text-based Visual Question Answering (TextVQA) aims at answering questions about the text in images. Most works in this field focus on designing network structures or pre-training tasks. All these methods list the OCR texts in reading order (from left to right and top to bottom) to form a sequence, which is treated as a natural language ``sentence''. However, they ignore the fact that most OCR words in the TextVQA task do not have a semantical contextual relationship. In addition, these approaches use 1-D position embedding to construct the spatial relation between OCR tokens sequentially, which is not reasonable. The 1-D position embedding can only represent the left-right sequence relationship between words in a sentence, but not the complex spatial position relationship. To tackle these problems, we propose a novel method named Separate and Locate (SaL) that explores text contextual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsFocus
