FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures

Jifeng Song; Arun Das; Pan Wang; Hui Ji; Kun Zhao; Yufei Huang

arXiv:2601.08026·cs.CV·March 31, 2026

FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures

Jifeng Song, Arun Das, Pan Wang, Hui Ji, Kun Zhao, Yufei Huang

PDF

TL;DR

FigEx2 is a novel visual-conditioned framework that localizes panels and generates captions for scientific compound figures, transforming unusable images into valuable panel-text pairs for improved scientific figure understanding.

Contribution

It introduces a new detection and captioning method with a noise-aware fusion module and a curated benchmark, enabling zero-shot transfer across scientific domains.

Findings

01

Achieves 0.728 [email protected]:0.95 for panel detection.

02

Outperforms existing models in METEOR and BERTScore metrics.

03

Transfers zero-shot to physics and chemistry figures without fine-tuning.

Abstract

Scientific compound figures combine multiple labeled panels into a single image. However, in a PMC-scale crawl of 346,567 compound figures, 16.3% have no caption and 1.8% only have captions shorter than ten words, causing them to be discarded by existing caption-decomposition pipelines. We propose FigEx2, a visual-conditioned framework that localizes panels and generates panel-wise captions directly from the image, converting otherwise unusable figures into aligned panel-text pairs for downstream pretraining and retrieval. To mitigate linguistic variance in open-ended captioning, we introduce a noise-aware gated fusion module that adaptively controls how caption features condition the detection query space, and employ a staged SFT+RL strategy with CLIP-based alignment and BERTScore-based semantic rewards. To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.