Towards Signboard-Oriented Visual Question Answering: ViSignVQA Dataset, Method and Benchmark

Hieu Minh Nguyen; Tam Le-Thanh Dang; Kiet Van Nguyen

arXiv:2512.22218·cs.CV·December 30, 2025

Towards Signboard-Oriented Visual Question Answering: ViSignVQA Dataset, Method and Benchmark

Hieu Minh Nguyen, Tam Le-Thanh Dang, Kiet Van Nguyen

PDF

Open Access

TL;DR

This paper introduces ViSignVQA, a large-scale Vietnamese signboard dataset for VQA, along with adapted models and a multi-agent framework, demonstrating significant improvements in understanding signboard text in natural scenes.

Contribution

It provides the first large-scale Vietnamese signboard VQA dataset, benchmarks adapted models with OCR integration, and proposes a multi-agent VQA framework using GPT-4 for improved accuracy.

Findings

01

OCR integration improves F1-score by up to 209%.

02

Multi-agent framework achieves 75.98% accuracy.

03

Dataset captures diverse Vietnamese signboard characteristics.

Abstract

Understanding signboard text in natural scenes is essential for real-world applications of Visual Question Answering (VQA), yet remains underexplored, particularly in low-resource languages. We introduce ViSignVQA, the first large-scale Vietnamese dataset designed for signboard-oriented VQA, which comprises 10,762 images and 25,573 question-answer pairs. The dataset captures the diverse linguistic, cultural, and visual characteristics of Vietnamese signboards, including bilingual text, informal phrasing, and visual elements such as color and layout. To benchmark this task, we adapted state-of-the-art VQA models (e.g., BLIP-2, LaTr, PreSTU, and SaL) by integrating a Vietnamese OCR model (SwinTextSpotter) and a Vietnamese pretrained language model (ViT5). The experimental results highlight the significant role of the OCR-enhanced context, with F1-score improvements of up to 209% when the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Hand Gesture Recognition Systems