CIC-BART-SSA: Controllable Image Captioning with Structured Semantic Augmentation
Kalliopi Basioti, Mohamed A. Abdelsalam, Federico Fancellu, Vladimir, Pavlovic, Afsaneh Fazly

TL;DR
This paper introduces CIC-BART-SSA, a novel controllable image captioning model that uses structured semantic augmentation with AMR to improve diversity, focus, and controllability of generated captions.
Contribution
The paper proposes a new SSA framework using AMR for dataset augmentation and a tailored CIC-BART-SSA model that enhances controllability and diversity in image captioning.
Findings
CIC-BART-SSA outperforms state-of-the-art models in diversity and text quality.
The SSA framework increases dataset semantic and spatial diversity.
The model effectively generalizes to highly focused captioning scenarios.
Abstract
Controllable Image Captioning (CIC) aims at generating natural language descriptions for an image, conditioned on information provided by end users, e.g., regions, entities or events of interest. However, available image-language datasets mainly contain captions that describe the entirety of an image, making them ineffective for training CIC models that can potentially attend to any subset of regions or relationships. To tackle this challenge, we propose a novel, fully automatic method to sample additional focused and visually grounded captions using a unified structured semantic representation built on top of the existing set of captions associated with an image. We leverage Abstract Meaning Representation (AMR), a cross-lingual graph-based semantic formalism, to encode all possible spatio-semantic relations between entities, beyond the typical spatial-relations-only focus of current…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications
MethodsSparse Evolutionary Training · Focus
