Multi-LLM Collaborative Caption Generation in Scientific Documents

Jaeyoung Kim; Jongho Lee; Hong-Jun Choi; Ting-Yao Hsu; Chieh-Yang; Huang; Sungchul Kim; Ryan Rossi; Tong Yu; Clyde Lee Giles; Ting-Hao 'Kenneth'; Huang; Sungchul Choi

arXiv:2501.02552·cs.CL·January 7, 2025

Multi-LLM Collaborative Caption Generation in Scientific Documents

Jaeyoung Kim, Jongho Lee, Hong-Jun Choi, Ting-Yao Hsu, Chieh-Yang, Huang, Sungchul Kim, Ryan Rossi, Tong Yu, Clyde Lee Giles, Ting-Hao 'Kenneth', Huang, Sungchul Choi

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper presents MLBCAP, a collaborative framework using multiple specialized large language models to generate high-quality, contextually rich figure captions for scientific documents, surpassing existing methods and even human performance.

Contribution

The paper introduces a novel multi-LLM framework that improves scientific figure captioning by data quality assessment, diverse caption generation, and a caption selection process, addressing limitations of prior single-model approaches.

Findings

01

Human evaluations favor MLBCAP-generated captions over human-written ones.

02

Filtering low-quality data improves captioning performance.

03

Collaborative LLMs produce more informative captions than existing methods.

Abstract

Scientific figure captioning is a complex task that requires generating contextually appropriate descriptions of visual content. However, existing methods often fall short by utilizing incomplete information, treating the task solely as either an image-to-text or text summarization problem. This limitation hinders the generation of high-quality captions that fully capture the necessary details. Moreover, existing data sourced from arXiv papers contain low-quality captions, posing significant challenges for training large language models (LLMs). In this paper, we introduce a framework called Multi-LLM Collaborative Figure Caption Generation (MLBCAP) to address these challenges by leveraging specialized LLMs for distinct sub-tasks. Our approach unfolds in three key modules: (Quality Assessment) We utilize multimodal LLMs to assess the quality of training data, enabling the filtration of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

teamreboott/mlbcap
noneOfficial

Datasets

TEAMREBOOTT-AI/SciCap-MLBCAP
dataset· 91 dl
91 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Biomedical Text Mining and Ontologies · Text Readability and Simplification