TL;DR
M$^3$-VQA is a new benchmark designed to evaluate multimodal large language models on complex multi-entity, multi-hop reasoning tasks involving visual and textual data, highlighting current challenges and improvements with structured retrieval.
Contribution
Introduces M$^3$-VQA, a challenging multimodal VQA benchmark with multi-entity, multi-hop questions and detailed evidence, to better assess and advance MLLMs' reasoning capabilities.
Findings
Models perform poorly without external knowledge.
Providing gold evidence significantly improves results.
Structured, reasoning-aware retrieval outperforms heuristic methods.
Abstract
We present M-VQA, a novel knowledge-based Visual Question Answering (VQA) benchmark, to enhance the evaluation of multimodal large language models (MLLMs) in fine-grained multimodal entity understanding and complex multi-hop reasoning. Unlike existing VQA datasets that focus on coarse-grained categories and simple reasoning over single entities, M-VQA introduces diverse multi-entity questions involving multiple distinct entities from both visual and textual sources. It requires models to perform both sequential and parallel multi-hop reasoning across multiple documents, supported by traceable, detailed evidence and a curated multimodal knowledge base. We evaluate 16 leading MLLMs under three settings: without external knowledge, with gold evidence, and with retrieval-augmented input. The poor results reveal significant challenges for MLLMs in knowledge acquisition and reasoning.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
