Figuring out Figures: Using Textual References to Caption Scientific   Figures

Stanley Cao; Kevin Liu

arXiv:2407.11008·cs.CL·July 17, 2024

Figuring out Figures: Using Textual References to Caption Scientific Figures

Stanley Cao, Kevin Liu

PDF

Open Access

TL;DR

This paper introduces a novel approach for automatically generating scientific figure captions by leveraging a CLIP+GPT-2 model conditioned on images and textual metadata, outperforming previous single-layer LSTM methods.

Contribution

The work presents a new dataset MetaSciCap and demonstrates that incorporating textual metadata with advanced encoder-decoder models improves captioning accuracy.

Findings

01

CLIP+GPT-2 with textual metadata outperforms previous models.

02

Using only textual metadata yields the best captioning performance.

03

Incorporating paper metadata enhances figure caption generation.

Abstract

Figures are essential channels for densely communicating complex ideas in scientific papers. Previous work in automatically generating figure captions has been largely unsuccessful and has defaulted to using single-layer LSTMs, which no longer achieve state-of-the-art performance. In our work, we use the SciCap datasets curated by Hsu et al. and use a variant of a CLIP+GPT-2 encoder-decoder model with cross-attention to generate captions conditioned on the image. Furthermore, we augment our training pipeline by creating a new dataset MetaSciCap that incorporates textual metadata from the original paper relevant to the figure, such as the title, abstract, and in-text references. We use SciBERT to encode the textual metadata and use this encoding alongside the figure embedding. In our experimentation with different models, we found that the CLIP+GPT-2 model performs better when it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Video Analysis and Summarization · Subtitles and Audiovisual Media