TL;DR
This paper introduces a novel Vietnamese Visual Question Answering model that combines transformer and convolutional neural networks, achieving a 71.04% accuracy on the ViVQA dataset by enhancing image feature extraction and multimodal fusion.
Contribution
The study presents a new Vietnamese VQA system integrating BLIP-2, EfficientNet, and BEiT-3, which improves image representation and reduces training costs while outperforming existing baselines.
Findings
Achieved 71.04% accuracy on ViVQA test set.
Enhanced image feature extraction with combined transformer and CNN.
Reduced training time by freezing pre-trained model parameters.
Abstract
Visual Question Answering (VQA) has recently emerged as a potential research domain, captivating the interest of many in the field of artificial intelligence and computer vision. Despite the prevalence of approaches in English, there is a notable lack of systems specifically developed for certain languages, particularly Vietnamese. This study aims to bridge this gap by conducting comprehensive experiments on the Vietnamese Visual Question Answering (ViVQA) dataset, demonstrating the effectiveness of our proposed model. In response to community interest, we have developed a model that enhances image representation capabilities, thereby improving overall performance in the ViVQA system. Specifically, our model integrates the Bootstrapping Language-Image Pre-training with frozen unimodal models (BLIP-2) and the convolutional neural network EfficientNet to extract and process both local and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training · Depthwise Convolution · *Communicated@Fast*How Do I Communicate to Expedia? · Sigmoid Activation · Pointwise Convolution · Depthwise Separable Convolution · Squeeze-and-Excitation Block · Average Pooling · (FiLe@Against@Claim)How do I file a claim against Expedia? · Dense Connections
