ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing
Long Xing, Qidong Huang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Jinsong Li, Shuangrui Ding, Weiming Zhang, Nenghai Yu, Jiaqi Wang, Feng Wu, Dahua Lin

TL;DR
ScaleCap introduces an inference-time scalable image captioning method that reduces biases and hallucinations by progressively enriching captions through heuristic questioning and contrastive decoding, improving accuracy and detail.
Contribution
It proposes a novel scalable debiasing strategy with heuristic question answering and contrastive sentence rating to enhance caption quality during inference.
Findings
Captions become more accurate and balanced with increased inference budget.
ScaleCap improves performance across 11 benchmark datasets.
Generated captions show higher fidelity and semantic coverage.
Abstract
This paper presents ScaleCap, an inference-time scalable image captioning strategy that generates comprehensive and detailed image captions. The key challenges of high-quality image captioning lie in the inherent biases of LVLMs: multimodal bias resulting in imbalanced descriptive granularity, offering detailed accounts of some elements while merely skimming over others; linguistic bias leading to hallucinated descriptions of non-existent objects. To address these issues, we propose a scalable debiased captioning strategy, which continuously enriches and calibrates the caption with increased inference budget. Specifically, we propose two novel components: heuristic question answering and contrastive sentence rating. The former generates content-specific questions based on the image and answers them to progressively inject relevant information into the caption. The latter employs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
