Do Large Multimodal Models Solve Caption Generation for Scientific   Figures? Lessons Learned from SciCap Challenge 2023

Ting-Yao E. Hsu; Yi-Li Hsu; Shaurya Rohatgi; Chieh-Yang Huang; Ho Yin; Sam Ng; Ryan Rossi; Sungchul Kim; Tong Yu; Lun-Wei Ku; C. Lee Giles; Ting-Hao; K. Huang

arXiv:2501.19353·cs.CL·February 19, 2025

Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SciCap Challenge 2023

Ting-Yao E. Hsu, Yi-Li Hsu, Shaurya Rohatgi, Chieh-Yang Huang, Ho Yin, Sam Ng, Ryan Rossi, Sungchul Kim, Tong Yu, Lun-Wei Ku, C. Lee Giles, Ting-Hao, K. Huang

PDF

Open Access 1 Video

TL;DR

This paper evaluates whether large multimodal models have effectively solved scientific figure captioning, based on the SciCap Challenge 2023, revealing GPT-4V's superior caption quality as perceived by professional editors.

Contribution

It provides an overview of the SciCap Challenge 2023 and analyzes the performance of various models, highlighting GPT-4V's dominance in caption quality.

Findings

01

GPT-4V captions preferred by professional editors

02

Most models did not outperform human-written captions

03

The challenge revealed gaps in current multimodal models' capabilities

Abstract

Since the SciCap datasets launch in 2021, the research community has made significant progress in generating captions for scientific figures in scholarly articles. In 2023, the first SciCap Challenge took place, inviting global teams to use an expanded SciCap dataset to develop models for captioning diverse figure types across various academic fields. At the same time, text generation models advanced quickly, with many powerful pre-trained large multimodal models (LMMs) emerging that showed impressive capabilities in various vision-and-language tasks. This paper presents an overview of the first SciCap Challenge and details the performance of various models on its data, capturing a snapshot of the fields state. We found that professional editors overwhelmingly preferred figure captions generated by GPT-4V over those from all other models and even the original captions written by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SciCap Challenge 2023.· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications