Multi-LLM Collaborative Caption Generation in Scientific Documents
Jaeyoung Kim, Jongho Lee, Hong-Jun Choi, Ting-Yao Hsu, Chieh-Yang, Huang, Sungchul Kim, Ryan Rossi, Tong Yu, Clyde Lee Giles, Ting-Hao 'Kenneth', Huang, Sungchul Choi

TL;DR
This paper presents MLBCAP, a collaborative framework using multiple specialized large language models to generate high-quality, contextually rich figure captions for scientific documents, surpassing existing methods and even human performance.
Contribution
The paper introduces a novel multi-LLM framework that improves scientific figure captioning by data quality assessment, diverse caption generation, and a caption selection process, addressing limitations of prior single-model approaches.
Findings
Human evaluations favor MLBCAP-generated captions over human-written ones.
Filtering low-quality data improves captioning performance.
Collaborative LLMs produce more informative captions than existing methods.
Abstract
Scientific figure captioning is a complex task that requires generating contextually appropriate descriptions of visual content. However, existing methods often fall short by utilizing incomplete information, treating the task solely as either an image-to-text or text summarization problem. This limitation hinders the generation of high-quality captions that fully capture the necessary details. Moreover, existing data sourced from arXiv papers contain low-quality captions, posing significant challenges for training large language models (LLMs). In this paper, we introduce a framework called Multi-LLM Collaborative Figure Caption Generation (MLBCAP) to address these challenges by leveraging specialized LLMs for distinct sub-tasks. Our approach unfolds in three key modules: (Quality Assessment) We utilize multimodal LLMs to assess the quality of training data, enabling the filtration of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Biomedical Text Mining and Ontologies · Text Readability and Simplification
