AutoViVQA: A Large-Scale Automatically Constructed Dataset for Vietnamese Visual Question Answering

Nguyen Anh Tuong; Phan Ba Duc; Nguyen Trung Quoc; Tran Dac Thinh; Dang Duy Lan; Nguyen Quoc Thinh; Tung Le

arXiv:2603.09689·cs.CV·March 12, 2026

AutoViVQA: A Large-Scale Automatically Constructed Dataset for Vietnamese Visual Question Answering

Nguyen Anh Tuong, Phan Ba Duc, Nguyen Trung Quoc, Tran Dac Thinh, Dang Duy Lan, Nguyen Quoc Thinh, Tung Le

PDF

Open Access

TL;DR

This paper introduces AutoViVQA, a large-scale Vietnamese VQA dataset, and explores transformer-based models for Vietnamese visual question answering, comparing evaluation metrics and emphasizing multimodal understanding.

Contribution

It presents AutoViVQA, a new extensive dataset for Vietnamese VQA, and systematically evaluates transformer-based models and metrics in this low-resource language context.

Findings

01

Transformer models improve Vietnamese VQA performance.

02

Automatic metrics vary in correlation with human judgment.

03

Pre-trained multilingual models enhance multimodal understanding.

Abstract

Visual Question Answering (VQA) is a fundamental multimodal task that requires models to jointly understand visual and textual information. Early VQA systems relied heavily on language biases, motivating subsequent work to emphasize visual grounding and balanced datasets. With the success of large-scale pre-trained transformers for both text and vision domains -- such as PhoBERT for Vietnamese language understanding and Vision Transformers (ViT) for image representation learning -- multimodal fusion has achieved remarkable progress. For Vietnamese VQA, several datasets have been introduced to promote research in low-resource multimodal learning, including ViVQA, OpenViVQA, and the recently proposed ViTextVQA. These resources enable benchmarking of models that integrate linguistic and visual features in the Vietnamese context. Evaluation of VQA systems often employs automatic metrics…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling