REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering
Yuanze Lin, Yujia Xie, Dongdong Chen, Yichong Xu, Chenguang Zhu, Lu, Yuan

TL;DR
This paper introduces REVIVE, a novel approach that leverages explicit regional visual information to significantly enhance knowledge-based VQA performance, emphasizing the importance of object regions and relationships.
Contribution
REVIVE is the first method to explicitly utilize object region information throughout the knowledge retrieval and answering stages in knowledge-based VQA.
Findings
Achieved 58.0% accuracy on OK-VQA, setting a new state-of-the-art.
Demonstrated the importance of regional information in different framework components.
Showed that better regional visual features lead to substantial performance improvements.
Abstract
This paper revisits visual representation in knowledge-based visual question answering (VQA) and demonstrates that using regional information in a better way can significantly improve the performance. While visual representation is extensively studied in traditional VQA, it is under-explored in knowledge-based VQA even though these two tasks share the common spirit, i.e., rely on visual input to answer the question. Specifically, we observe that in most state-of-the-art knowledge-based VQA methods: 1) visual features are extracted either from the whole image or in a sliding window manner for retrieving knowledge, and the important relationship within/among object regions is neglected; 2) visual features are not well utilized in the final answering model, which is counter-intuitive to some extent. Based on these observations, we propose a new knowledge-based VQA method REVIVE, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
