ReflectCAP: Detailed Image Captioning with Reflective Memory
Kyungmin Min, Minbeom Kim, Kang-il Lee, Seunghyun Yoon, Kyomin Jung

TL;DR
ReflectCAP introduces a multi-agent approach that guides large vision-language models to generate more factual and comprehensive image captions, balancing quality and computational efficiency.
Contribution
The paper proposes ReflectCAP, a novel method using structured reflection notes to improve factuality and coverage in image captioning with large models.
Findings
ReflectCAP achieves Pareto optimality between factuality and coverage.
It outperforms existing methods on CapArena-Auto dataset.
Offers better quality-cost trade-offs than model scaling and multi-agent pipelines.
Abstract
Detailed image captioning demands both factual grounding and fine-grained coverage, yet existing methods have struggled to achieve them simultaneously. We address this tension with Reflective Note-Guided Captioning (ReflectCAP), where a multi-agent pipeline analyzes what the target large vision-language model (LVLM) consistently hallucinates and what it systematically overlooks, distilling these patterns into reusable guidelines called Structured Reflection Notes. At inference time, these notes steer the captioning model along both axes -- what to avoid and what to attend to -- yielding detailed captions that jointly improve factuality and coverage. Applying this method to 8 LVLMs spanning the GPT-4.1 family, Qwen series, and InternVL variants, ReflectCAP reaches the Pareto frontier of the trade-off between factuality and coverage, and delivers substantial gains on CapArena-Auto, where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
