Figuring out Figures: Using Textual References to Caption Scientific Figures
Stanley Cao, Kevin Liu

TL;DR
This paper introduces a novel approach for automatically generating scientific figure captions by leveraging a CLIP+GPT-2 model conditioned on images and textual metadata, outperforming previous single-layer LSTM methods.
Contribution
The work presents a new dataset MetaSciCap and demonstrates that incorporating textual metadata with advanced encoder-decoder models improves captioning accuracy.
Findings
CLIP+GPT-2 with textual metadata outperforms previous models.
Using only textual metadata yields the best captioning performance.
Incorporating paper metadata enhances figure caption generation.
Abstract
Figures are essential channels for densely communicating complex ideas in scientific papers. Previous work in automatically generating figure captions has been largely unsuccessful and has defaulted to using single-layer LSTMs, which no longer achieve state-of-the-art performance. In our work, we use the SciCap datasets curated by Hsu et al. and use a variant of a CLIP+GPT-2 encoder-decoder model with cross-attention to generate captions conditioned on the image. Furthermore, we augment our training pipeline by creating a new dataset MetaSciCap that incorporates textual metadata from the original paper relevant to the figure, such as the title, abstract, and in-text references. We use SciBERT to encode the textual metadata and use this encoding alongside the figure embedding. In our experimentation with different models, we found that the CLIP+GPT-2 model performs better when it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Video Analysis and Summarization · Subtitles and Audiovisual Media
