Advancing Vietnamese Visual Question Answering with Transformer and   Convolutional Integration

Ngoc Son Nguyen; Van Son Nguyen; Tung Le

arXiv:2407.21229·cs.CV·August 1, 2024

Advancing Vietnamese Visual Question Answering with Transformer and Convolutional Integration

Ngoc Son Nguyen, Van Son Nguyen, Tung Le

PDF

1 Repo

TL;DR

This paper introduces a novel Vietnamese Visual Question Answering model that combines transformer and convolutional neural networks, achieving a 71.04% accuracy on the ViVQA dataset by enhancing image feature extraction and multimodal fusion.

Contribution

The study presents a new Vietnamese VQA system integrating BLIP-2, EfficientNet, and BEiT-3, which improves image representation and reduces training costs while outperforming existing baselines.

Findings

01

Achieved 71.04% accuracy on ViVQA test set.

02

Enhanced image feature extraction with combined transformer and CNN.

03

Reduced training time by freezing pre-trained model parameters.

Abstract

Visual Question Answering (VQA) has recently emerged as a potential research domain, captivating the interest of many in the field of artificial intelligence and computer vision. Despite the prevalence of approaches in English, there is a notable lack of systems specifically developed for certain languages, particularly Vietnamese. This study aims to bridge this gap by conducting comprehensive experiments on the Vietnamese Visual Question Answering (ViVQA) dataset, demonstrating the effectiveness of our proposed model. In response to community interest, we have developed a model that enhances image representation capabilities, thereby improving overall performance in the ViVQA system. Specifically, our model integrates the Bootstrapping Language-Image Pre-training with frozen unimodal models (BLIP-2) and the convolutional neural network EfficientNet to extract and process both local and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nngocson2002/ViVQA
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSparse Evolutionary Training · Depthwise Convolution · *Communicated@Fast*How Do I Communicate to Expedia? · Sigmoid Activation · Pointwise Convolution · Depthwise Separable Convolution · Squeeze-and-Excitation Block · Average Pooling · (FiLe@Against@Claim)How do I file a claim against Expedia? · Dense Connections