ViInfographicVQA: A Benchmark for Single and Multi-image Visual Question Answering on Vietnamese Infographics

Tue-Thu Van-Dinh; Hoang-Duy Tran; Truong-Binh Duong; Mai-Hanh Pham; Binh-Nam Le-Nguyen; Quoc-Thai Nguyen

arXiv:2512.12424·cs.CV·December 16, 2025

ViInfographicVQA: A Benchmark for Single and Multi-image Visual Question Answering on Vietnamese Infographics

Tue-Thu Van-Dinh, Hoang-Duy Tran, Truong-Binh Duong, Mai-Hanh Pham, Binh-Nam Le-Nguyen, Quoc-Thai Nguyen

PDF

Open Access 1 Video

TL;DR

ViInfographicVQA introduces the first Vietnamese benchmark for infographic visual question answering, emphasizing layout understanding, OCR, and reasoning across single and multiple images, highlighting current model limitations.

Contribution

This paper presents ViInfographicVQA, a new Vietnamese infographic VQA benchmark with over 6,747 infographics and 20,409 questions, including cross-image reasoning tasks.

Findings

01

Significant performance gaps on multi-image questions.

02

Current models struggle with cross-image integration.

03

Benchmark reveals limitations of existing vision-language models.

Abstract

Infographic Visual Question Answering (InfographicVQA) evaluates a model's ability to read and reason over data-rich, layout-heavy visuals that combine text, charts, icons, and design elements. Compared with scene-text or natural-image VQA, infographics require stronger integration of OCR, layout understanding, and numerical and semantic reasoning. We introduce ViInfographicVQA, the first benchmark for Vietnamese InfographicVQA, comprising over 6747 real-world infographics and 20409 human-verified question-answer pairs across economics, healthcare, education, and more. The benchmark includes two evaluation settings. The Single-image task follows the traditional setup in which each question is answered using a single infographic. The Multi-image task requires synthesizing evidence across multiple semantically related infographics and is, to our knowledge, the first Vietnamese evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ViInfographicVQA: A Benchmark for Single and Multi-image Visual Question Answering on Vietnamese Infographics· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques