FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity
Hang Hua, Qing Liu, Lingzhi Zhang, Jing Shi, Zhifei Zhang, Yilin Wang,, Jianming Zhang, Jiebo Luo

TL;DR
FINECAPTION introduces a vision-language model capable of fine-grained, compositional image captioning by recognizing arbitrary masks and processing high-resolution images, supported by a new dataset for multi-grained captioning.
Contribution
The paper presents FINECAPTION, a novel VLM that handles compositional captioning at various granularities and introduces COMPOSITIONCAP, a dataset for regional compositional captioning.
Findings
FINECAPTION outperforms existing VLMs in compositional captioning tasks.
Current VLMs show limitations in recognizing diverse visual prompts.
The new dataset enables better evaluation of regional compositional understanding.
Abstract
The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal tasks, enabling more sophisticated and accurate reasoning across various applications, including image and video captioning, visual question answering, and cross-modal retrieval. Despite their superior capabilities, VLMs struggle with fine-grained image regional composition information perception. Specifically, they have difficulty accurately aligning the segmentation masks with the corresponding semantics and precisely describing the compositional aspects of the referred regions. However, compositionality - the ability to understand and generate novel combinations of known visual and textual components - is critical for facilitating coherent reasoning and understanding across modalities by VLMs. To address this issue, we propose FINECAPTION, a novel VLM that can recognize arbitrary masks as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media
