SciCap+: A Knowledge Augmented Dataset to Study the Challenges of Scientific Figure Captioning
Zhishen Yang, Raj Dabre, Hideki Tanaka, Naoaki Okazaki

TL;DR
This paper introduces SciCap+, an extended dataset and a knowledge-augmented approach for scientific figure captioning, demonstrating that additional context improves caption quality, with implications for automating scientific communication.
Contribution
The paper presents SciCap+, an extended dataset with mention-paragraphs and OCR tokens, and evaluates a multimodal transformer model showing improved captioning performance with added context.
Findings
Mention-paragraphs significantly improve captioning scores.
Human evaluation highlights challenges in generating informative captions.
Knowledge-augmented models outperform figure-only baselines.
Abstract
In scholarly documents, figures provide a straightforward way of communicating scientific findings to readers. Automating figure caption generation helps move model understandings of scientific documents beyond text and will help authors write informative captions that facilitate communicating scientific findings. Unlike previous studies, we reframe scientific figure captioning as a knowledge-augmented image captioning task that models need to utilize knowledge embedded across modalities for caption generation. To this end, we extended the large-scale SciCap dataset~\cite{hsu-etal-2021-scicap-generating} to SciCap+ which includes mention-paragraphs (paragraphs mentioning figures) and OCR tokens. Then, we conduct experiments with the M4C-Captioner (a multimodal transformer-based model with a pointer network) as a baseline for our study. Our results indicate that mention-paragraphs serves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling
