Controllable Contextualized Image Captioning: Directing the Visual   Narrative through User-Defined Highlights

Shunqi Mao; Chaoyi Zhang; Hang Su; Hwanjun Song; Igor Shalyminov,; Weidong Cai

arXiv:2407.11449·cs.CV·July 17, 2024

Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights

Shunqi Mao, Chaoyi Zhang, Hang Su, Hwanjun Song, Igor Shalyminov,, Weidong Cai

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces Controllable Contextualized Image Captioning (Ctrl-CIC), enabling user-directed, focused image captions through novel prompting and recalibration methods, evaluated with GPT-4V and standard metrics.

Contribution

It presents two innovative approaches, P-Ctrl and R-Ctrl, for generating user-controllable, highlight-focused image captions in the CIC domain.

Findings

01

Effective control over caption focus demonstrated

02

GPT-4V evaluator aligns well with human judgment

03

New direction for user-adaptive image captioning

Abstract

Contextualized Image Captioning (CIC) evolves traditional image captioning into a more complex domain, necessitating the ability for multimodal reasoning. It aims to generate image captions given specific contextual information. This paper further introduces a novel domain of Controllable Contextualized Image Captioning (Ctrl-CIC). Unlike CIC, which solely relies on broad context, Ctrl-CIC accentuates a user-defined highlight, compelling the model to tailor captions that resonate with the highlighted aspects of the context. We present two approaches, Prompting-based Controller (P-Ctrl) and Recalibration-based Controller (R-Ctrl), to generate focused captions. P-Ctrl conditions the model generation on highlight by prepending captions with highlight-driven prefixes, whereas R-Ctrl tunes the model to selectively recalibrate the encoder embeddings for highlighted tokens. Additionally, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shunqim/ctrl-cic
pytorchOfficial

Models

🤗
Shunqi/Ctrl-CIC
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques