CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal   Pre-trained Knowledge

Linli Yao; Weijing Chen; Qin Jin

arXiv:2211.09371·cs.CV·March 21, 2023

CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge

Linli Yao, Weijing Chen, Qin Jin

PDF

Open Access 1 Repo

TL;DR

CapEnrich leverages cross-modal pre-trained models to automatically enrich web image descriptions with more semantic details, improving their diversity and informativeness without additional annotations.

Contribution

This paper introduces a plug-and-play framework that uses prompting strategies with VLP models to enhance image captions with richer semantics, requiring only lightweight tuning.

Findings

01

Significant improvement in description diversity and detail.

02

Effective use of prompt tuning with VLP models.

03

No additional human annotations needed.

Abstract

Automatically generating textual descriptions for massive unlabeled images on the web can greatly benefit realistic web applications, e.g. multimodal retrieval and recommendation. However, existing models suffer from the problem of generating ``over-generic'' descriptions, such as their tendency to generate repetitive sentences with common concepts for different images. These generic descriptions fail to provide sufficient textual semantics for ever-changing web images. Inspired by the recent success of Vision-Language Pre-training (VLP) models that learn diverse image-text concept alignment during pretraining, we explore leveraging their cross-modal pre-trained knowledge to automatically enrich the textual semantics of image descriptions. With no need for additional human annotations, we propose a plug-and-play framework, i.e CapEnrich, to complement the generic image descriptions with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yaolinli/capenrich
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques

Methodsfail