Improved RAMEN: Towards Domain Generalization for Visual Question Answering
Bhanuka Manesha Samarasekara Vitharana Gamage, Lim Chern Hong

TL;DR
This paper enhances the RAMEN model for Visual Question Answering by introducing vector operation-based fusion and transformer-based aggregation modules, significantly improving domain generalization across multiple datasets.
Contribution
It proposes two novel improvements to RAMEN's architecture, focusing on fusion and aggregation modules, to better generalize across diverse VQA datasets.
Findings
Up to five VQA datasets show improved performance.
Vector-based fusion strategies enhance feature integration.
Transformer-based aggregation improves domain robustness.
Abstract
Currently nearing human-level performance, Visual Question Answering (VQA) is an emerging area in artificial intelligence. Established as a multi-disciplinary field in machine learning, both computer vision and natural language processing communities are working together to achieve state-of-the-art (SOTA) performance. However, there is a gap between the SOTA results and real world applications. This is due to the lack of model generalisation. The RAMEN model \cite{Shrestha2019} aimed to achieve domain generalization by obtaining the highest score across two main types of VQA datasets. This study provides two major improvements to the early/late fusion module and aggregation module of the RAMEN architecture, with the objective of further strengthening domain generalization. Vector operations based fusion strategies are introduced for the fusion module and the transformer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
