Generating More Pertinent Captions by Leveraging Semantics and Style on   Multi-Source Datasets

Marcella Cornia; Lorenzo Baraldi; Giuseppe Fiameni; Rita Cucchiara

arXiv:2111.12727·cs.CV·December 1, 2023·1 cites

Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets

Marcella Cornia, Lorenzo Baraldi, Giuseppe Fiameni, Rita Cucchiara

PDF

Open Access

TL;DR

This paper presents a novel image captioning method that leverages semantics and style separation using style tokens and keywords, trained on multi-source datasets to generate high-quality, style-appropriate captions without object detectors.

Contribution

The proposed model uniquely separates semantics and style, effectively combining noisy web data with clean human annotations, and outperforms existing methods on multiple datasets.

Findings

01

Model recognizes real-world concepts effectively

02

Produces high-quality, style-consistent captions

03

Outperforms state-of-the-art approaches on benchmarks

Abstract

This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both human-annotated and web-collected captions. Large-scale datasets with noisy image-text pairs, indeed, provide a sub-optimal source of supervision because of their low-quality descriptive style, while human-annotated datasets are cleaner but smaller in scale. To get the best of both worlds, we propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component. The proposed model avoids the need of object detectors, is trained with a single objective of prompt language modeling, and can replicate the style of human-collected captions while training on sources with different input styles. Experimentally, the model shows a strong capability of recognizing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling