MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering
Xianwei Mao, Kai Ye, Sheng Zhou, Nan Zhang, Haikuan Huang, Bin Li, Jiajun Bu

TL;DR
MaS-VQA introduces a selection-driven framework that filters relevant knowledge and image regions to improve reasoning and accuracy in knowledge-based visual question answering tasks.
Contribution
It proposes a novel Mask-and-Select mechanism for explicit knowledge filtering, enhancing the integration of visual and external knowledge for VQA.
Findings
Consistent performance improvements on Encyclopedic-VQA and InfoSeek datasets.
Effective noise reduction through the selection mechanism.
Enhanced knowledge utilization leading to better answer accuracy.
Abstract
Knowledge-based Visual Question Answering (KB-VQA) requires models to answer questions by integrating visual information with external knowledge. However, retrieved knowledge is often noisy, partially irrelevant, or misaligned with the visual content, while internal model knowledge is difficult to control and interpret. Naive aggregation of these sources limits reasoning effectiveness and reduces answer accuracy. To address this, we propose MaS-VQA, a selection-driven framework that tightly couples explicit knowledge filtering with implicit knowledge reasoning. MaS-VQA first retrieves candidate passages and applies a Mask-and-Select mechanism to jointly prune irrelevant image regions and weakly relevant knowledge fragments, producing compact, high-signal multimodal knowledge . This filtered knowledge then guides the activation of internal knowledge in a constrained semantic space,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Topic Modeling
