The Devil is in the EOS: Sequence Training for Detailed Image Captioning
Abdelrahman Mohamed, Yova Kementchedjhieva

TL;DR
This paper identifies and addresses the bias towards early termination in image captioning models, proposing an unsupervised debiasing method that produces longer, more detailed captions without complex supervision.
Contribution
The authors introduce a simple, unsupervised approach to reduce EOS bias in pretrained vision-language models, enhancing caption detail without additional supervision or complex rewards.
Findings
Increased caption length and detail across three benchmarks.
Effective reduction of EOS bias in multiple VLMs.
Trade-off observed with increased hallucinations.
Abstract
Despite significant advances in vision-language models (VLMs), image captioning often suffers from a lack of detail, with base models producing short, generic captions. This limitation persists even though VLMs are equipped with strong vision and language backbones. While supervised data and complex reward functions have been proposed to improve detailed image captioning, we identify a simpler underlying issue: a bias towards the end-of-sequence (EOS) token, which is introduced during cross-entropy training. We propose an unsupervised method to debias the model's tendency to predict the EOS token prematurely. By reducing this bias, we encourage the generation of longer, more detailed captions without the need for intricate reward functions or supervision. Our approach is straightforward, effective, and easily applicable to any pretrained model. We demonstrate its effectiveness through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
