OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual   Question Answering in Vietnamese

Nghia Hieu Nguyen; Duong T.D. Vo; Kiet Van Nguyen; Ngan Luu-Thuy; Nguyen

arXiv:2305.04183·cs.CL·October 3, 2023·2 cites

OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual Question Answering in Vietnamese

Nghia Hieu Nguyen, Duong T.D. Vo, Kiet Van Nguyen, Ngan Luu-Thuy, Nguyen

PDF

Open Access 1 Repo 1 Models 2 Datasets

TL;DR

This paper introduces OpenViVQA, a large-scale Vietnamese VQA dataset with open-ended answers, and proposes multimodal fusion models that generate answers more akin to human responses, advancing VQA research for low-resource languages.

Contribution

The paper presents the first large-scale Vietnamese VQA dataset and novel multimodal fusion models that generate answers, not just select them, for low-resource language VQA tasks.

Findings

01

Proposed models achieve competitive results with state-of-the-art methods.

02

OpenViVQA dataset contains over 37,000 question-answer pairs.

03

Models demonstrate improved answer generation in Vietnamese VQA.

Abstract

In recent years, visual question answering (VQA) has attracted attention from the research community because of its highly potential applications (such as virtual assistance on intelligent cars, assistant devices for blind people, or information retrieval from document images using natural language as queries) and challenge. The VQA task requires methods that have the ability to fuse the information from questions and images to produce appropriate answers. Neural visual question answering models have achieved tremendous growth on large-scale datasets which are mostly for resource-rich languages such as English. However, available datasets narrow the VQA task as the answers selection task or answer classification task. We argue that this form of VQA is far from human ability and eliminates the challenge of the answering aspect in the VQA task by just selecting answers rather than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hieunghia-pat/openvivqa-dataset
noneOfficial

Models

🤗
letuan/mblip-mt0-xl-vivqa
model· 2 dl
2 dl

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning