From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature

Kun Yuan; Min Woo Sun; Zhen Chen; Alejandro Lozano; Xiangteng He; Shi Li; Nassir Navab; Xiaoxiao Sun; Nicolas Padoy; Serena Yeung-Levy

arXiv:2512.02566·cs.CV·March 26, 2026

From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature

Kun Yuan, Min Woo Sun, Zhen Chen, Alejandro Lozano, Xiangteng He, Shi Li, Nassir Navab, Xiaoxiao Sun, Nicolas Padoy, Serena Yeung-Levy

PDF

Open Access

TL;DR

This paper introduces Panel2Patch, a hierarchical data pipeline that extracts multi-granular vision-language supervision from biomedical literature figures, improving pretraining effectiveness by preserving local semantics and enabling finer-grained understanding.

Contribution

The paper presents a novel method for mining hierarchical structure from biomedical figures and text, creating multi-level vision-language pairs for more effective pretraining.

Findings

01

Enhanced performance with less pretraining data

02

More effective supervision than prior pipelines

03

Improved understanding of local figure semantics

Abstract

There is a growing interest in developing strong biomedical vision-language models. A popular approach to achieve robust representations is to use web-scale scientific data. However, current biomedical vision-language pretraining typically compresses rich scientific figures and text into coarse figure-level pairs, discarding the fine-grained correspondences that clinicians actually rely on when zooming into local structures. To tackle this issue, we introduce Panel2Patch, a novel data pipeline that mines hierarchical structure from existing biomedical scientific literature, i.e., multi-panel, marker-heavy figures and their surrounding text, and converts them into multi-granular supervision. Given scientific figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs hierarchical aligned vision-language pairs at the figure, panel, and patch levels,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Biomedical Text Mining and Ontologies