OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual Question Answering in Vietnamese
Nghia Hieu Nguyen, Duong T.D. Vo, Kiet Van Nguyen, Ngan Luu-Thuy, Nguyen

TL;DR
This paper introduces OpenViVQA, a large-scale Vietnamese VQA dataset with open-ended answers, and proposes multimodal fusion models that generate answers more akin to human responses, advancing VQA research for low-resource languages.
Contribution
The paper presents the first large-scale Vietnamese VQA dataset and novel multimodal fusion models that generate answers, not just select them, for low-resource language VQA tasks.
Findings
Proposed models achieve competitive results with state-of-the-art methods.
OpenViVQA dataset contains over 37,000 question-answer pairs.
Models demonstrate improved answer generation in Vietnamese VQA.
Abstract
In recent years, visual question answering (VQA) has attracted attention from the research community because of its highly potential applications (such as virtual assistance on intelligent cars, assistant devices for blind people, or information retrieval from document images using natural language as queries) and challenge. The VQA task requires methods that have the ability to fuse the information from questions and images to produce appropriate answers. Neural visual question answering models have achieved tremendous growth on large-scale datasets which are mostly for resource-rich languages such as English. However, available datasets narrow the VQA task as the answers selection task or answer classification task. We argue that this form of VQA is far from human ability and eliminates the challenge of the answering aspect in the VQA task by just selecting answers rather than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
