MedHorizon: Towards Long-context Medical Video Understanding in the Wild
Bodong Du, Bowen Liu, Yang Yu, Xinpeng Ding, Zhiheng Wu, Shuning Wang, Shuo Nie, Naiming Liu, Qifeng Chen, Yangqiu Song, Xiaomeng Li

TL;DR
MedHorizon introduces a challenging benchmark for long-context medical video understanding, emphasizing sparse evidence retrieval and multi-hop reasoning in full-length clinical procedures.
Contribution
This work presents MedHorizon, a new in-the-wild benchmark with sparse evidence and comprehensive evaluation of medical video understanding models.
Findings
Current models achieve only 41.1% accuracy on MedHorizon.
Performance does not reliably improve with more frames.
Evidence retrieval and reasoning are primary bottlenecks.
Abstract
Medical multimodal large language models (MLLMs) have advanced image understanding and short-video analysis, but real clinical review often requires full-procedure video understanding. Unlike general long videos, medical procedures contain highly redundant anatomical views, while decisive evidence is temporally sparse, spatially subtle, and context dependent. Existing benchmarks often assume this evidence has already been localized through images, short clips, or pre-segmented videos, leaving the retrieval-before-reasoning problem under-tested. We introduce MedHorizon, an in-the-wild benchmark for long-context medical video understanding. MedHorizon preserves 759 hours of full-length clinical procedures and provides 1,253 evidence-grounded multiple-choice questionsthat jointly evaluate sparse evidence understanding and multi-hop clinical reasoning. Its evidence is extremely sparse, with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
