Towards Escaping from Language Bias and OCR Error: Semantics-Centered   Text Visual Question Answering

Chengyang Fang; Gangyan Zeng; Yu Zhou; Daiqing Wu; Can Ma; Dayong Hu,; Weiping Wang

arXiv:2203.12929·cs.CV·September 6, 2023·1 cites

Towards Escaping from Language Bias and OCR Error: Semantics-Centered Text Visual Question Answering

Chengyang Fang, Gangyan Zeng, Yu Zhou, Daiqing Wu, Can Ma, Dayong Hu,, Weiping Wang

PDF

Open Access

TL;DR

This paper introduces SC-Net, a novel model for TextVQA that effectively reduces language bias and OCR errors by focusing on semantic understanding, leading to improved accuracy on standard datasets.

Contribution

The paper presents a semantics-centered network with contrastive semantic prediction and transformer modules, addressing limitations of existing TextVQA models.

Findings

01

SC-Net outperforms previous models on TextVQA and ST-VQA datasets.

02

The model demonstrates robustness against language biases and OCR errors.

03

Extensive experiments validate the effectiveness of the proposed approach.

Abstract

Texts in scene images convey critical information for scene understanding and reasoning. The abilities of reading and reasoning matter for the model in the text-based visual question answering (TextVQA) process. However, current TextVQA models do not center on the text and suffer from several limitations. The model is easily dominated by language biases and optical character recognition (OCR) errors due to the absence of semantic guidance in the answer prediction process. In this paper, we propose a novel Semantics-Centered Network (SC-Net) that consists of an instance-level contrastive semantic prediction module (ICSP) and a semantics-centered transformer module (SCT). Equipped with the two modules, the semantics-centered model can resist the language biases and the accumulated errors from OCR. Extensive experiments on TextVQA and ST-VQA datasets show the effectiveness of our model.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition