Benchmarking and Improving Detail Image Caption
Hongyuan Dong, Jiawen Li, Bohong Wu, Jiacong Wang, Yuan Zhang, Haoyuan, Guo

TL;DR
This paper introduces a new benchmark and a reliable evaluation metric for detailed image captioning, and demonstrates how to enhance LVLMs' captioning abilities through a self-supervised data synthesis pipeline.
Contribution
It proposes a high-quality dataset and a novel CAPTURE metric for detailed image captioning evaluation, and develops a self-looping data construction pipeline to improve LVLM performance.
Findings
CAPTURE achieves higher consistency with expert judgments than existing metrics.
The data synthesis pipeline significantly enhances LVLMs' captioning quality.
Model performance improves with iterative self-looping data refinement.
Abstract
Image captioning has long been regarded as a fundamental task in visual understanding. Recently, however, few large vision-language model (LVLM) research discusses model's image captioning performance because of the outdated short-caption benchmarks and unreliable evaluation metrics. In this work, we propose to benchmark detail image caption task by curating high-quality evaluation datasets annotated by human experts, GPT-4V and Gemini-1.5-Pro. We also design a more reliable caption evaluation metric called CAPTURE (CAPtion evaluation by exTracting and coUpling coRE information). CAPTURE extracts visual elements, e.g., objects, attributes and relations from captions, and then matches these elements through three stages, achieving the highest consistency with expert judgements over other rule-based or model-based caption metrics. The proposed benchmark and metric provide reliable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Vision and Imaging
