MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering

Xianwei Mao; Kai Ye; Sheng Zhou; Nan Zhang; Haikuan Huang; Bin Li; Jiajun Bu

arXiv:2602.15915·cs.CV·February 19, 2026

MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering

Xianwei Mao, Kai Ye, Sheng Zhou, Nan Zhang, Haikuan Huang, Bin Li, Jiajun Bu

PDF

Open Access

TL;DR

MaS-VQA introduces a selection-driven framework that filters relevant knowledge and image regions to improve reasoning and accuracy in knowledge-based visual question answering tasks.

Contribution

It proposes a novel Mask-and-Select mechanism for explicit knowledge filtering, enhancing the integration of visual and external knowledge for VQA.

Findings

01

Consistent performance improvements on Encyclopedic-VQA and InfoSeek datasets.

02

Effective noise reduction through the selection mechanism.

03

Enhanced knowledge utilization leading to better answer accuracy.

Abstract

Knowledge-based Visual Question Answering (KB-VQA) requires models to answer questions by integrating visual information with external knowledge. However, retrieved knowledge is often noisy, partially irrelevant, or misaligned with the visual content, while internal model knowledge is difficult to control and interpret. Naive aggregation of these sources limits reasoning effectiveness and reduces answer accuracy. To address this, we propose MaS-VQA, a selection-driven framework that tightly couples explicit knowledge filtering with implicit knowledge reasoning. MaS-VQA first retrieves candidate passages and applies a Mask-and-Select mechanism to jointly prune irrelevant image regions and weakly relevant knowledge fragments, producing compact, high-signal multimodal knowledge . This filtered knowledge then guides the activation of internal knowledge in a constrained semantic space,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Topic Modeling