Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets
Marcella Cornia, Lorenzo Baraldi, Giuseppe Fiameni, Rita Cucchiara

TL;DR
This paper presents a novel image captioning method that leverages semantics and style separation using style tokens and keywords, trained on multi-source datasets to generate high-quality, style-appropriate captions without object detectors.
Contribution
The proposed model uniquely separates semantics and style, effectively combining noisy web data with clean human annotations, and outperforms existing methods on multiple datasets.
Findings
Model recognizes real-world concepts effectively
Produces high-quality, style-consistent captions
Outperforms state-of-the-art approaches on benchmarks
Abstract
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both human-annotated and web-collected captions. Large-scale datasets with noisy image-text pairs, indeed, provide a sub-optimal source of supervision because of their low-quality descriptive style, while human-annotated datasets are cleaner but smaller in scale. To get the best of both worlds, we propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component. The proposed model avoids the need of object detectors, is trained with a single objective of prompt language modeling, and can replicate the style of human-collected captions while training on sources with different input styles. Experimentally, the model shows a strong capability of recognizing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
