ReflectCAP: Detailed Image Captioning with Reflective Memory

Kyungmin Min; Minbeom Kim; Kang-il Lee; Seunghyun Yoon; Kyomin Jung

arXiv:2604.12357·cs.AI·April 15, 2026

ReflectCAP: Detailed Image Captioning with Reflective Memory

Kyungmin Min, Minbeom Kim, Kang-il Lee, Seunghyun Yoon, Kyomin Jung

PDF

TL;DR

ReflectCAP introduces a multi-agent approach that guides large vision-language models to generate more factual and comprehensive image captions, balancing quality and computational efficiency.

Contribution

The paper proposes ReflectCAP, a novel method using structured reflection notes to improve factuality and coverage in image captioning with large models.

Findings

01

ReflectCAP achieves Pareto optimality between factuality and coverage.

02

It outperforms existing methods on CapArena-Auto dataset.

03

Offers better quality-cost trade-offs than model scaling and multi-agent pipelines.

Abstract

Detailed image captioning demands both factual grounding and fine-grained coverage, yet existing methods have struggled to achieve them simultaneously. We address this tension with Reflective Note-Guided Captioning (ReflectCAP), where a multi-agent pipeline analyzes what the target large vision-language model (LVLM) consistently hallucinates and what it systematically overlooks, distilling these patterns into reusable guidelines called Structured Reflection Notes. At inference time, these notes steer the captioning model along both axes -- what to avoid and what to attend to -- yielding detailed captions that jointly improve factuality and coverage. Applying this method to 8 LVLMs spanning the GPT-4.1 family, Qwen series, and InternVL variants, ReflectCAP reaches the Pareto frontier of the trade-off between factuality and coverage, and delivers substantial gains on CapArena-Auto, where…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.