ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing

Long Xing; Qidong Huang; Xiaoyi Dong; Pan Zhang; Yuhang Zang; Yuhang Cao; Jinsong Li; Shuangrui Ding; Weiming Zhang; Nenghai Yu; Jiaqi Wang; Feng Wu; Dahua Lin

arXiv:2506.19848·cs.CV·June 25, 2025

ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing

Long Xing, Qidong Huang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Jinsong Li, Shuangrui Ding, Weiming Zhang, Nenghai Yu, Jiaqi Wang, Feng Wu, Dahua Lin

PDF

Open Access 1 Datasets

TL;DR

ScaleCap introduces an inference-time scalable image captioning method that reduces biases and hallucinations by progressively enriching captions through heuristic questioning and contrastive decoding, improving accuracy and detail.

Contribution

It proposes a novel scalable debiasing strategy with heuristic question answering and contrastive sentence rating to enhance caption quality during inference.

Findings

01

Captions become more accurate and balanced with increased inference budget.

02

ScaleCap improves performance across 11 benchmark datasets.

03

Generated captions show higher fidelity and semantic coverage.

Abstract

This paper presents ScaleCap, an inference-time scalable image captioning strategy that generates comprehensive and detailed image captions. The key challenges of high-quality image captioning lie in the inherent biases of LVLMs: multimodal bias resulting in imbalanced descriptive granularity, offering detailed accounts of some elements while merely skimming over others; linguistic bias leading to hallucinated descriptions of non-existent objects. To address these issues, we propose a scalable debiased captioning strategy, which continuously enriches and calibrates the caption with increased inference budget. Specifically, we propose two novel components: heuristic question answering and contrastive sentence rating. The former generates content-specific questions based on the image and answers them to progressively inject relevant information into the caption. The latter employs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

long-xing1/ScaleCap-450k
dataset· 139 dl
139 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization