Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage

Saehyung Lee; Seunghyun Yoon; Trung Bui; Jing Shi; Sungroh Yoon

arXiv:2412.15484·cs.CV·July 8, 2025

Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage

Saehyung Lee, Seunghyun Yoon, Trung Bui, Jing Shi, Sungroh Yoon

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a multiagent approach and new evaluation metrics to improve the factual accuracy and coverage of hyper-detailed image captions generated by multimodal large language models, addressing hallucination issues.

Contribution

It proposes a collaborative multiagent method for caption correction and introduces a benchmark dataset with evaluation metrics tailored for detailed caption factuality.

Findings

01

The new evaluation method aligns better with human judgments of factuality.

02

The proposed approach significantly improves caption factual accuracy, even for GPT-4V.

03

VQA benchmark performance does not necessarily reflect captioning quality.

Abstract

Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations. Our analysis reveals that existing hallucination detection methods struggle with detailed captions. We attribute this to the increasing reliance of MLLMs on their generated text, rather than the input image, as the sequence length grows. To address this issue, we propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions. Additionally, we introduce an evaluation framework and a benchmark dataset to facilitate the systematic analysis of detailed captions. Our experiments demonstrate that our proposed evaluation method better aligns with human judgments of factuality than existing metrics and that existing approaches to improve the MLLM factuality may fall short in hyper-detailed image captioning tasks. In contrast, our proposed method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

saehyungl/CapMAS
dataset· 14 dl
14 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques