Image Embedding Sampling Method for Diverse Captioning
Sania Waheed, Na Min An

TL;DR
This paper presents a training-free image captioning method that improves diversity and detail by focusing on distinct image regions using a small VLM, achieving performance comparable to larger models without extra training.
Contribution
It introduces a novel, training-free framework utilizing structured segmentation to enhance caption diversity and detail in small VLMs for image captioning.
Findings
Achieved high Div-2 scores on MSCOCO, Flickr30k, and Nocaps datasets.
Maintained strong image-caption relevancy and semantic integrity.
Enabled smaller VLMs to match larger models' performance without additional training.
Abstract
Image Captioning for state-of-the-art VLMs has significantly improved over time; however, this comes at the cost of increased computational complexity, making them less accessible for resource-constrained applications such as mobile devices and assistive technologies. Alternatively, comparably smaller VLMs prioritize high-level scene descriptions, overlooking finer details that contribute to a richer understanding of an image. In this paper, we introduce a training-free framework that enhances caption diversity and informativeness by explicitly attending to distinct image regions using a comparably small VLM, BLIP, as the backbone. Our approach leverages structured segmentation to produce hierarchical representations that capture both global and localized semantics. Without requiring additional model training, we demonstrate that our method allows smaller VLMs to achieve performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Image Enhancement Techniques
