ViConsFormer: Constituting Meaningful Phrases of Scene Texts using   Transformer-based Method in Vietnamese Text-based Visual Question Answering

Nghia Hieu Nguyen; Tho Thanh Quan; Ngan Luu-Thuy Nguyen

arXiv:2410.14132·cs.CV·October 25, 2024

ViConsFormer: Constituting Meaningful Phrases of Scene Texts using Transformer-based Method in Vietnamese Text-based Visual Question Answering

Nghia Hieu Nguyen, Tho Thanh Quan, Ngan Luu-Thuy Nguyen

PDF

Open Access 1 Repo

TL;DR

ViConsFormer is a novel transformer-based method that effectively captures meaningful phrases from scene texts in Vietnamese, significantly improving performance in text-based visual question answering tasks.

Contribution

The paper introduces a new linguistically motivated approach that leverages scene text meaning, achieving state-of-the-art results in Vietnamese Text-based VQA.

Findings

01

Achieved state-of-the-art performance on Vietnamese Text-based VQA datasets.

02

Effectively exploits scene text meaning using a linguistically grounded method.

03

Demonstrated the importance of semantic understanding in scene text analysis.

Abstract

Text-based VQA is a challenging task that requires machines to use scene texts in given images to yield the most appropriate answer for the given question. The main challenge of text-based VQA is exploiting the meaning and information from scene texts. Recent studies tackled this challenge by considering the spatial information of scene texts in images via embedding 2D coordinates of their bounding boxes. In this study, we follow the definition of meaning from linguistics to introduce a novel method that effectively exploits the information from scene texts written in Vietnamese. Experimental results show that our proposed method obtains state-of-the-art results on two large-scale Vietnamese Text-based VQA datasets. The implementation can be found at this link.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hieunghia-pat/ViConsFormer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications