Human-like Controllable Image Captioning with Verb-specific Semantic Roles
Long Chen, Zhihong Jiang, Jun Xiao, Wei Liu

TL;DR
This paper introduces Verb-specific Semantic Roles as a new control signal for Controllable Image Captioning, enabling more human-like, event-compatible, and sample-suitable caption generation with improved controllability and diversity.
Contribution
It proposes VSR as a novel control signal for CIC, along with a grounded semantic role labeling model, a semantic structure planner, and a role-shift captioning model, enhancing controllability and diversity.
Findings
Outperforms strong baselines on CIC benchmarks
Achieves better controllability and diversity in generated captions
Enables multi-level diverse caption generation
Abstract
Controllable Image Captioning (CIC) -- generating image descriptions following designated control signals -- has received unprecedented attention over the last few years. To emulate the human ability in controlling caption generation, current CIC studies focus exclusively on control signals concerning objective properties, such as contents of interest or descriptive patterns. However, we argue that almost all existing objective control signals have overlooked two indispensable characteristics of an ideal control signal: 1) Event-compatible: all visual contents referred to in a single sentence should be compatible with the described activity. 2) Sample-suitable: the control signals should be suitable for a specific image sample. To this end, we propose a new control signal for CIC: Verb-specific Semantic Roles (VSR). VSR consists of a verb and some semantic roles, which represents a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
