ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question   Answering by Understanding Vietnamese Text in Images

Huy Quang Pham; Thang Kien-Bao Nguyen; Quan Van Nguyen; Dan Quang; Tran; Nghia Hieu Nguyen; Kiet Van Nguyen; Ngan Luu-Thuy Nguyen

arXiv:2404.18397·cs.CV·April 30, 2024

ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images

Huy Quang Pham, Thang Kien-Bao Nguyen, Quan Van Nguyen, Dan Quang, Tran, Nghia Hieu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

PDF

Open Access 2 Repos

TL;DR

This paper introduces ViOCRVQA, a new Vietnamese OCR-VQA dataset with over 28,000 images and 120,000 question-answer pairs, along with a novel VisionReader model, highlighting challenges in low-resource language VQA.

Contribution

The paper presents the first large-scale Vietnamese OCR-VQA dataset and a new VisionReader model, advancing research in low-resource language visual question answering.

Findings

01

VisionReader achieved 0.4116 EM and 0.6990 F1-score.

02

OCR system significantly impacts VQA performance.

03

Object information enhances model accuracy.

Abstract

Optical Character Recognition - Visual Question Answering (OCR-VQA) is the task of answering text information contained in images that have just been significantly developed in the English language in recent years. However, there are limited studies of this task in low-resource languages such as Vietnamese. To this end, we introduce a novel dataset, ViOCRVQA (Vietnamese Optical Character Recognition - Visual Question Answering dataset), consisting of 28,000+ images and 120,000+ question-answer pairs. In this dataset, all the images contain text and questions about the information relevant to the text in the images. We deploy ideas from state-of-the-art methods proposed for English to conduct experiments on our dataset, revealing the challenges and difficulties inherent in a Vietnamese dataset. Furthermore, we introduce a novel approach, called VisionReader, which achieved 0.4116 in EM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications