ControlCap: Controllable Region-level Captioning
Yuzhong Zhao, Yue Liu, Zonghao Guo, Weijia Wu, Chen Gong, Fang Wan,, Qixiang Ye

TL;DR
ControlCap introduces control words and a discriminative module to improve region-level captioning, effectively addressing caption degeneration and enabling more diverse, controllable captions with enhanced generalization.
Contribution
It proposes a novel controllable captioning framework that partitions caption space with control words, improving diversity and generalization over prior models.
Findings
Significant CIDEr score improvements on Visual Genome and RefCOCOg datasets.
Outperforms state-of-the-art methods in caption diversity and accuracy.
Enables captioning beyond training data with interactive control words.
Abstract
Region-level captioning is challenged by the caption degeneration issue, which refers to that pre-trained multimodal models tend to predict the most frequent captions but miss the less frequent ones. In this study, we propose a controllable region-level captioning (ControlCap) approach, which introduces control words to a multimodal model to address the caption degeneration issue. In specific, ControlCap leverages a discriminative module to generate control words within the caption space to partition it to multiple sub-spaces. The multimodal model is constrained to generate captions within a few sub-spaces containing the control words, which increases the opportunity of hitting less frequent captions, alleviating the caption degeneration issue. Furthermore, interactive control words can be given by either a human or an expert model, which enables captioning beyond the training caption…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT-based Smart Home Systems
