Integrating Image Features with Convolutional Sequence-to-sequence Network for Multilingual Visual Question Answering
Triet Minh Thai, Son T. Luu

TL;DR
This paper presents a multilingual visual question answering model that combines image features with a convolutional sequence-to-sequence network, achieving competitive results in a shared task across three languages.
Contribution
The work introduces a novel integration of image features with a convolutional sequence-to-sequence model for multilingual VQA, leveraging pre-trained models and achieving high performance.
Findings
Achieved an F1 score of 0.3442 on the public test set.
Secured 3rd place in the VLSP2022-EVJVQA challenge.
Effectively handled multilingual questions in English, Vietnamese, and Japanese.
Abstract
Visual Question Answering (VQA) is a task that requires computers to give correct answers for the input questions based on the images. This task can be solved by humans with ease but is a challenge for computers. The VLSP2022-EVJVQA shared task carries the Visual Question Answering task in the multilingual domain on a newly released dataset: UIT-EVJVQA, in which the questions and answers are written in three different languages: English, Vietnamese and Japanese. We approached the challenge as a sequence-to-sequence learning task, in which we integrated hints from pre-trained state-of-the-art VQA models and image features with Convolutional Sequence-to-Sequence network to generate the desired answers. Our results obtained up to 0.3442 by F1 score on the public test set, 0.4210 on the private test set, and placed 3rd in the competition.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsTest
