Integrating Image Features with Convolutional Sequence-to-sequence   Network for Multilingual Visual Question Answering

Triet Minh Thai; Son T. Luu

arXiv:2303.12671·cs.CV·June 18, 2024·1 cites

Integrating Image Features with Convolutional Sequence-to-sequence Network for Multilingual Visual Question Answering

Triet Minh Thai, Son T. Luu

PDF

Open Access 1 Repo

TL;DR

This paper presents a multilingual visual question answering model that combines image features with a convolutional sequence-to-sequence network, achieving competitive results in a shared task across three languages.

Contribution

The work introduces a novel integration of image features with a convolutional sequence-to-sequence model for multilingual VQA, leveraging pre-trained models and achieving high performance.

Findings

01

Achieved an F1 score of 0.3442 on the public test set.

02

Secured 3rd place in the VLSP2022-EVJVQA challenge.

03

Effectively handled multilingual questions in English, Vietnamese, and Japanese.

Abstract

Visual Question Answering (VQA) is a task that requires computers to give correct answers for the input questions based on the images. This task can be solved by humans with ease but is a challenge for computers. The VLSP2022-EVJVQA shared task carries the Visual Question Answering task in the multilingual domain on a newly released dataset: UIT-EVJVQA, in which the questions and answers are written in three different languages: English, Vietnamese and Japanese. We approached the challenge as a sequence-to-sequence learning task, in which we integrated hints from pre-trained state-of-the-art VQA models and image features with Convolutional Sequence-to-Sequence network to generate the desired answers. Our results obtained up to 0.3442 by F1 score on the public test set, 0.4210 on the private test set, and placed 3rd in the competition.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://huggingface.co/spaces/daeron/CONVS2S-EVJVQA-DEMO
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsTest