TL;DR
PubMed-Ophtha is a comprehensive, high-resolution dataset of ophthalmological images and captions extracted from open-access articles, designed to facilitate training vision-language models in ophthalmology.
Contribution
The paper introduces a novel hierarchical dataset with detailed annotations and a pipeline for extracting and annotating ophthalmology figures from scientific literature.
Findings
High-quality image-caption pairs with 102,023 entries.
Achieved a mean sentence BLEU score of 0.913 for caption segmentation.
Panel and image detection models reached [email protected] of 0.909 and 0.892.
Abstract
Vision-language models hold considerable promise for ophthalmology, but their development depends on large-scale, high-quality image-text datasets that remain scarce. We present PubMed-Ophtha, a hierarchical dataset of 102,023 ophthalmological image-caption pairs extracted from 15,842 open-access articles in PubMed Central. Unlike existing datasets, figures are extracted directly from article PDFs at full resolution and decomposed into their constituent panels, panel identifiers, and individual images. Each image is annotated with its imaging modality -- color fundus photography, optical coherence tomography, retinal imaging, or other -- and a mark status indicating the presence of annotation marks such as arrows. Figure captions are split into panel-level subcaptions using a two-step LLM approach, achieving a mean average sentence BLEU score of 0.913 on human-annotated data. Panel and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
