FINECAPTION: Compositional Image Captioning Focusing on Wherever You   Want at Any Granularity

Hang Hua; Qing Liu; Lingzhi Zhang; Jing Shi; Zhifei Zhang; Yilin Wang,; Jianming Zhang; Jiebo Luo

arXiv:2411.15411·cs.CV·November 26, 2024

FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

Hang Hua, Qing Liu, Lingzhi Zhang, Jing Shi, Zhifei Zhang, Yilin Wang,, Jianming Zhang, Jiebo Luo

PDF

Open Access 1 Models 1 Datasets

TL;DR

FINECAPTION introduces a vision-language model capable of fine-grained, compositional image captioning by recognizing arbitrary masks and processing high-resolution images, supported by a new dataset for multi-grained captioning.

Contribution

The paper presents FINECAPTION, a novel VLM that handles compositional captioning at various granularities and introduces COMPOSITIONCAP, a dataset for regional compositional captioning.

Findings

01

FINECAPTION outperforms existing VLMs in compositional captioning tasks.

02

Current VLMs show limitations in recognizing diverse visual prompts.

03

The new dataset enables better evaluation of regional compositional understanding.

Abstract

The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal tasks, enabling more sophisticated and accurate reasoning across various applications, including image and video captioning, visual question answering, and cross-modal retrieval. Despite their superior capabilities, VLMs struggle with fine-grained image regional composition information perception. Specifically, they have difficulty accurately aligning the segmentation masks with the corresponding semantics and precisely describing the compositional aspects of the referred regions. However, compositionality - the ability to understand and generate novel combinations of known visual and textual components - is critical for facilitating coherent reasoning and understanding across modalities by VLMs. To address this issue, we propose FINECAPTION, a novel VLM that can recognize arbitrary masks as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
hhua2/finecaption
model· ♡ 1
♡ 1

Datasets

hhua2/CompositionCap
dataset· 41 dl
41 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media