Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering
Jing Yu, Zihao Zhu, Yujing Wang, Weifeng Zhang, Yue Hu, Jianlong Tan

TL;DR
This paper introduces a multimodal, graph-based recurrent reasoning model for knowledge-based visual question answering, effectively capturing question-oriented evidence from visual, semantic, and factual knowledge graphs to improve accuracy.
Contribution
It proposes a novel multi-graph reasoning framework with GRUC modules that perform transitive reasoning over visual, semantic, and factual graphs for KVQA.
Findings
Achieves state-of-the-art results on FVQA, Visual7W-KB, and OK-VQA datasets.
Demonstrates improved interpretability and reasoning capability.
Effectively filters relevant information to answer questions accurately.
Abstract
Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image. This ability is challenging but indispensable to achieve general VQA. One limitation of existing KVQA solutions is that they jointly embed all kinds of information without fine-grained selection, which introduces unexpected noises for reasoning the correct answer. How to capture the question-oriented and information-complementary evidence remains a key challenge to solve the problem. Inspired by the human cognition theory, in this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views. Thereinto, the visual graph and semantic graph are regarded as image-conditioned instantiation of the factual graph. On top of these new representations, we re-formulate Knowledge-based Visual Question Answering as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsInterpretability
