MedicalNarratives: Connecting Medical Vision and Language with Localized Narratives

Wisdom O. Ikezogwo; Kevin Zhang; Mehmet Saygin Seyfioglu; Fatemeh Ghezloo; Linda Shapiro; Ranjay Krishna

arXiv:2501.04184·cs.CV·January 15, 2026

MedicalNarratives: Connecting Medical Vision and Language with Localized Narratives

Wisdom O. Ikezogwo, Kevin Zhang, Mehmet Saygin Seyfioglu, Fatemeh Ghezloo, Linda Shapiro, Ranjay Krishna

PDF

Open Access 1 Datasets

TL;DR

MedicalNarratives is a large-scale dataset of 4.7 million medical image-text pairs from YouTube videos, with dense spatial annotations and temporal grounding, enabling improved multimodal medical image understanding.

Contribution

The paper introduces MedicalNarratives, a novel dataset with dense spatial and temporal annotations for medical images, and demonstrates its effectiveness with a new model outperforming existing methods.

Findings

01

GenMedClip surpasses state-of-the-art on 12 medical domains

02

Dataset enables spatiotemporal grounding in medical images

03

Large-scale medical image-text pairs improve multimodal learning

Abstract

Multi-modal models are data hungry. While datasets with natural images are abundant, medical image datasets can not afford the same luxury. To enable representation learning for medical images at scale, we turn to YouTube, a platform with a large reservoir of open-source medical pedagogical videos. We curate MedicalNarratives, a dataset 4.7M medical image-text pairs, with 1M samples containing dense annotations in the form of spatial traces (and bounding boxes), and 118K videos centered on the trace event (with aligned text), enabling spatiotemporal grounding beyond single frames. Similar to $think-aloud$ studies where instructors speak while hovering their mouse cursor movements over relevant image regions, 1M images in MedicalNarratives contains localized mouse traces in image pixels, creating a spatial and temporal association between the text and pixels. To evaluate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

kzhang20/MedicalNarratives
dataset· 19 dl
19 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmpathy and Medical Education

MethodsContrastive Language-Image Pre-training