M$^3$-VQA: A Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering

Jiatong Ma; Longteng Guo; Yuchen Liu; Zijia Zhao; Dongze Hao; Xuanxu Lin; Jing Liu

arXiv:2604.25122·cs.CV·April 29, 2026

M$^3$-VQA: A Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering

Jiatong Ma, Longteng Guo, Yuchen Liu, Zijia Zhao, Dongze Hao, Xuanxu Lin, Jing Liu

PDF

1 Repo

TL;DR

M$^3$-VQA is a new benchmark designed to evaluate multimodal large language models on complex multi-entity, multi-hop reasoning tasks involving visual and textual data, highlighting current challenges and improvements with structured retrieval.

Contribution

Introduces M$^3$-VQA, a challenging multimodal VQA benchmark with multi-entity, multi-hop questions and detailed evidence, to better assess and advance MLLMs' reasoning capabilities.

Findings

01

Models perform poorly without external knowledge.

02

Providing gold evidence significantly improves results.

03

Structured, reasoning-aware retrieval outperforms heuristic methods.

Abstract

We present M $^{3}$ -VQA, a novel knowledge-based Visual Question Answering (VQA) benchmark, to enhance the evaluation of multimodal large language models (MLLMs) in fine-grained multimodal entity understanding and complex multi-hop reasoning. Unlike existing VQA datasets that focus on coarse-grained categories and simple reasoning over single entities, M $^{3}$ -VQA introduces diverse multi-entity questions involving multiple distinct entities from both visual and textual sources. It requires models to perform both sequential and parallel multi-hop reasoning across multiple documents, supported by traceable, detailed evidence and a curated multimodal knowledge base. We evaluate 16 leading MLLMs under three settings: without external knowledge, with gold evidence, and with retrieval-augmented input. The poor results reveal significant challenges for MLLMs in knowledge acquisition and reasoning.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CASIA-IVA-Lab/M3VQA
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.