Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering
Zihao Zhu, Jing Yu, Yujing Wang, Yajing Sun, Yue Hu, Qi Wu

TL;DR
This paper introduces Mucko, a multi-layer cross-modal reasoning framework using a heterogeneous graph convolutional network to improve fact-based visual question answering by effectively selecting and integrating relevant evidence across visual, semantic, and factual layers.
Contribution
It proposes a novel multi-layer graph-based model that performs iterative reasoning to better capture question-relevant evidence for FVQA, achieving state-of-the-art results.
Findings
Achieves new state-of-the-art performance on FVQA.
Effectively captures and integrates multi-modal evidence.
Demonstrates improved interpretability of reasoning process.
Abstract
Fact-based Visual Question Answering (FVQA) requires external knowledge beyond visible content to answer questions about an image, which is challenging but indispensable to achieve general VQA. One limitation of existing FVQA solutions is that they jointly embed all kinds of information without fine-grained selection, which introduces unexpected noises for reasoning the final answer. How to capture the question-oriented and information-complementary evidence remains a key challenge to solve the problem. In this paper, we depict an image by a multi-modal heterogeneous graph, which contains multiple layers of information corresponding to the visual, semantic and factual features. On top of the multi-layer graph representations, we propose a modality-aware heterogeneous graph convolutional network to capture evidence from different layers that is most relevant to the given question.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsInterpretability · Convolution
