From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature
Kun Yuan, Min Woo Sun, Zhen Chen, Alejandro Lozano, Xiangteng He, Shi Li, Nassir Navab, Xiaoxiao Sun, Nicolas Padoy, Serena Yeung-Levy

TL;DR
This paper introduces Panel2Patch, a hierarchical data pipeline that extracts multi-granular vision-language supervision from biomedical literature figures, improving pretraining effectiveness by preserving local semantics and enabling finer-grained understanding.
Contribution
The paper presents a novel method for mining hierarchical structure from biomedical figures and text, creating multi-level vision-language pairs for more effective pretraining.
Findings
Enhanced performance with less pretraining data
More effective supervision than prior pipelines
Improved understanding of local figure semantics
Abstract
There is a growing interest in developing strong biomedical vision-language models. A popular approach to achieve robust representations is to use web-scale scientific data. However, current biomedical vision-language pretraining typically compresses rich scientific figures and text into coarse figure-level pairs, discarding the fine-grained correspondences that clinicians actually rely on when zooming into local structures. To tackle this issue, we introduce Panel2Patch, a novel data pipeline that mines hierarchical structure from existing biomedical scientific literature, i.e., multi-panel, marker-heavy figures and their surrounding text, and converts them into multi-granular supervision. Given scientific figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs hierarchical aligned vision-language pairs at the figure, panel, and patch levels,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Biomedical Text Mining and Ontologies
