FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with Human Feedback
Ashish Singh, Ashutosh Singh, Prateek Agarwal, Zixuan Huang, Arpita Singh, Tong Yu, Sungchul Kim, Victor Bursztyn, Nesreen K. Ahmed, Puneet Mathur, Erik Learned-Miller, Franck Dernoncourt, Ryan A. Rossi

TL;DR
This paper introduces FigCaps-HF, a framework that improves scientific figure captioning by incorporating human feedback through reinforcement learning, resulting in more helpful and reader-aligned captions, and provides a new benchmark dataset.
Contribution
The paper presents a novel RLHF-based framework for figure captioning that integrates domain expert feedback and introduces a large-scale benchmark dataset.
Findings
RLHF improves caption quality metrics significantly.
The framework achieves up to 35.7% gain in ROUGE scores.
The dataset enables further research in human-aligned figure captioning.
Abstract
Captions are crucial for understanding scientific visualizations and documents. Existing captioning methods for scientific figures rely on figure-caption pairs extracted from documents for training, many of which fall short with respect to metrics like helpfulness, explainability, and visual-descriptiveness [15] leading to generated captions being misaligned with reader preferences. To enable the generation of high-quality figure captions, we introduce FigCaps-HF a new framework for figure-caption generation that can incorporate domain expert feedback in generating captions optimized for reader preferences. Our framework comprises of 1) an automatic method for evaluating quality of figure-caption pairs, 2) a novel reinforcement learning with human feedback (RLHF) method to optimize a generative figure-to-caption model for reader preferences. We demonstrate the effectiveness of our…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. This work introduces a framework that incorporates domain expert feedback for figure caption generation, an area that is underexplored. 2. The lightweight reward model is efficiently trained on a small dataset to score figure-caption pairs, yielding impressive results as demonstrated in Table 1. 3. The reinforcement learning approach effectively enhances the caption model's performance.
1. The paper lacks discussion of related work, including notable studies such as ArxivCap [1] and MMSci [2]. 2. There is no evaluation of current state-of-the-art multimodal language models, including proprietary models like GPT-4, Claude 3.5, and Gemini, or open-source models like Llava, Qwen-2 VL, InternVL, and MiniCPM. 3. It is unclear how the learned reward model ensures robust performance given the limited training data. Additionally, there is insufficient detail on the design choices and a
1. The paper is well written and easy to follow. 2. I appreciate the release of the dataset and documentation; which are a good contribution. 3. The conducted experiments and ablation, within the scope of the paper, are well motivated and justify the presented approach.
1. The presented baselines are missing some key models -- there are many new models such as LLaVA-NeXT that should be used as baselines. Guaranteed, they are larger models but in general does the zero-shot performance of these multimodal LLMs compare against the presented method. 2. The RLHF component should also be compared against existing methods such as direct preference optimization (DPO) to fully understand the advantages over the human feedback component. 3. There are no experiments demon
**Incorporation of Domain Expert Feedback**: The FigCaps-HF framework can incorporate domain expert feedback to optimize generated captions, making them more suitable for readers' needs. **Combination of Automatic Evaluation and RLHF**: The framework includes an automatic method for evaluating the quality of figure-caption pairs and utilizes a reinforcement learning with human feedback (RLHF) approach to further optimize the generative model. **Release of a Large-Scale Benchmark Dataset**: A l
**Layout and Aesthetic Issues**: The layout of the paper could be improved, and the figures are not visually appealing, which affects the overall presentation and readability. **Weak Baseline Models**: The paper uses BLIP and ViT+GPT2 as baseline models, which are relatively weak. Given the rapid advancement of multimodal large models, experiments should be conducted on more and stronger baselines to comprehensively validate the effectiveness of the proposed approach.
- Figure-to-caption generation is an interesting question and of great potential for aiding scientific paper writing; - The idea of incorporating expert feedback into figure-to-caption intuitive and effective as demonstrated by the experiments.
While the explorations and efforts of this paper is greatly appreciated, I think this paper can be improved quite a lot by addressing the following concerns. - Old backbone used: Given the rapid evolution of large vision-language models (e.g., LLaVA-series, Qwen-VL)., they have shown promising results regarding various benchmarks including figure-to-caption [1]. I would suggest the authors incorporate the results of commercial models to serve as an indicator for the sota performance and perfor
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Video Analysis and Summarization
MethodsBLIP: Bootstrapping Language-Image Pre-training · Balanced Selection
