M$^3$-Med: A Benchmark for Multi-lingual, Multi-modal, and Multi-hop Reasoning in Medical Instructional Video Understanding

Shenxi Liu; Kan Li; Mingyang Zhao; Yuhang Tian; Bin Li; Shoujun Zhou; Hongliang Li; Fuxia Yang

arXiv:2507.04289·cs.CV·July 8, 2025

M$^3$-Med: A Benchmark for Multi-lingual, Multi-modal, and Multi-hop Reasoning in Medical Instructional Video Understanding

Shenxi Liu, Kan Li, Mingyang Zhao, Yuhang Tian, Bin Li, Shoujun Zhou, Hongliang Li, Fuxia Yang

PDF

TL;DR

M3-Med is a comprehensive benchmark designed to evaluate multi-lingual, multi-modal, and multi-hop reasoning capabilities in medical instructional videos, addressing current limitations in existing datasets.

Contribution

It introduces the first multi-lingual, multi-modal, and multi-hop reasoning benchmark for medical videos, with tasks requiring deep cross-modal understanding and expert-annotated questions.

Findings

01

Significant performance gap between models and humans on complex questions

02

Models struggle with multi-hop reasoning across modalities

03

Benchmark reveals current AI limitations in medical video understanding

Abstract

With the rapid progress of artificial intelligence (AI) in multi-modal understanding, there is increasing potential for video comprehension technologies to support professional domains such as medical education. However, existing benchmarks suffer from two primary limitations: (1) Linguistic Singularity: they are largely confined to English, neglecting the need for multilingual resources; and (2) Shallow Reasoning: their questions are often designed for surface-level information retrieval, failing to properly assess deep multi-modal integration. To address these limitations, we present M3-Med, the first benchmark for Multi-lingual, Multi-modal, and Multi-hop reasoning in Medical instructional video understanding. M3-Med consists of medical questions paired with corresponding video segments, annotated by a team of medical experts. A key innovation of M3-Med is its multi-hop reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.