MUREL: Multimodal Relational Reasoning for Visual Question Answering
Remi Cadene, Hedi Ben-younes, Matthieu Cord, Nicolas Thome

TL;DR
MuRel introduces a multimodal relational reasoning network that enhances visual question answering by modeling complex interactions and relations between image regions and questions, surpassing attention-based methods.
Contribution
The paper proposes MuRel, a novel end-to-end trainable relational network with a new reasoning primitive, improving over existing attention-based VQA models.
Findings
Outperforms attention-based models on VQA 2.0, VQA-CP v2, and TDIUC datasets.
The MuRel network achieves state-of-the-art or competitive results.
Ablation studies confirm the effectiveness of the relational reasoning approach.
Abstract
Multimodal attentional networks are currently state-of-the-art models for Visual Question Answering (VQA) tasks involving real images. Although attention allows to focus on the visual content relevant to the question, this simple mechanism is arguably insufficient to model complex reasoning features required for VQA or other high-level tasks. In this paper, we propose MuRel, a multimodal relational network which is learned end-to-end to reason over real images. Our first contribution is the introduction of the MuRel cell, an atomic reasoning primitive representing interactions between question and image regions by a rich vectorial representation, and modeling region relations with pairwise combinations. Secondly, we incorporate the cell into a full MuRel network, which progressively refines visual and question interactions, and can be leveraged to define visualization schemes finer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
