Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA
Yongxin Zhu, Zhen Liu, Yukang Liang, Xin Li, Hao Liu, Changcun Bao,, Linli Xu

TL;DR
This paper introduces a novel 'Locate Then Generate' framework for Scene Text VQA that unifies visual and linguistic semantics via bounding boxes, significantly improving accuracy without pre-training.
Contribution
It proposes a new paradigm that explicitly links visual and linguistic information using bounding boxes, enhancing scene text question answering performance.
Findings
Boosts accuracy by over 6% on TextVQA and ST-VQA datasets.
Unifies visual and linguistic modalities through bounding boxes.
Operates effectively without scene text pre-training.
Abstract
In this paper, we propose a novel multi-modal framework for Scene Text Visual Question Answering (STVQA), which requires models to read scene text in images for question answering. Apart from text or visual objects, which could exist independently, scene text naturally links text and visual modalities together by conveying linguistic semantics while being a visual object in an image simultaneously. Different to conventional STVQA models which take the linguistic semantics and visual semantics in scene text as two separate features, in this paper, we propose a paradigm of "Locate Then Generate" (LTG), which explicitly unifies this two semantics with the spatial bounding box as a bridge connecting them. Specifically, at first, LTG locates the region in an image that may contain the answer words with an answer location module (ALM) consisting of a region proposal network and a language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
