Benchmarking and Improving Detail Image Caption

Hongyuan Dong; Jiawen Li; Bohong Wu; Jiacong Wang; Yuan Zhang; Haoyuan; Guo

arXiv:2405.19092·cs.CV·July 9, 2024·3 cites

Benchmarking and Improving Detail Image Caption

Hongyuan Dong, Jiawen Li, Bohong Wu, Jiacong Wang, Yuan Zhang, Haoyuan, Guo

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces a new benchmark and a reliable evaluation metric for detailed image captioning, and demonstrates how to enhance LVLMs' captioning abilities through a self-supervised data synthesis pipeline.

Contribution

It proposes a high-quality dataset and a novel CAPTURE metric for detailed image captioning evaluation, and develops a self-looping data construction pipeline to improve LVLM performance.

Findings

01

CAPTURE achieves higher consistency with expert judgments than existing metrics.

02

The data synthesis pipeline significantly enhances LVLMs' captioning quality.

03

Model performance improves with iterative self-looping data refinement.

Abstract

Image captioning has long been regarded as a fundamental task in visual understanding. Recently, however, few large vision-language model (LVLM) research discusses model's image captioning performance because of the outdated short-caption benchmarks and unreliable evaluation metrics. In this work, we propose to benchmark detail image caption task by curating high-quality evaluation datasets annotated by human experts, GPT-4V and Gemini-1.5-Pro. We also design a more reliable caption evaluation metric called CAPTURE (CAPtion evaluation by exTracting and coUpling coRE information). CAPTURE extracts visual elements, e.g., objects, attributes and relations from captions, and then matches these elements through three stages, achieving the highest consistency with expert judgements over other rule-based or model-based caption metrics. The proposed benchmark and metric provide reliable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

foundation-multimodal-models/capture
pytorchOfficial

Datasets

foundation-multimodal-models/DetailCaps-4870
dataset· 678 dl
678 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Vision and Imaging