UIT-Saviors at MEDVQA-GI 2023: Improving Multimodal Learning with Image Enhancement for Gastrointestinal Visual Question Answering
Triet M. Thai, Anh T. Vo, Hao K. Tieu, Linh N.P. Bui, Thien T.B., Nguyen

TL;DR
This paper presents a multimodal learning approach with image enhancement for gastrointestinal visual question answering, demonstrating improved accuracy and F1-Score using Transformer-based vision models and fusion techniques.
Contribution
The study introduces an image enhancement technique combined with a multimodal architecture using BERT and Transformer vision models, achieving state-of-the-art results in GI VQA.
Findings
Transformer-based vision models outperform CNNs in GI VQA.
Image enhancement significantly improves model performance.
Best model achieves 87.25% accuracy and 91.85% F1-Score.
Abstract
In recent years, artificial intelligence has played an important role in medicine and disease diagnosis, with many applications to be mentioned, one of which is Medical Visual Question Answering (MedVQA). By combining computer vision and natural language processing, MedVQA systems can assist experts in extracting relevant information from medical image based on a given question and providing precise diagnostic answers. The ImageCLEFmed-MEDVQA-GI-2023 challenge carried out visual question answering task in the gastrointestinal domain, which includes gastroscopy and colonoscopy images. Our team approached Task 1 of the challenge by proposing a multimodal learning method with image enhancement to improve the VQA performance on gastrointestinal images. The multimodal architecture is set up with BERT encoder and different pre-trained vision models based on convolutional neural network (CNN)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsAttention Is All You Need · Absolute Position Encodings · Linear Warmup With Linear Decay · Attention Dropout · Label Smoothing · Refunds@Expedia|||How do I get a full refund from Expedia? · Byte Pair Encoding · Linear Layer · Adam · WordPiece
