C-CLIP: Contrastive Image-Text Encoders to Close the   Descriptive-Commentative Gap

William Theisen; Walter Scheirer

arXiv:2309.03921·cs.CV·September 11, 2023·1 cites

C-CLIP: Contrastive Image-Text Encoders to Close the Descriptive-Commentative Gap

William Theisen, Walter Scheirer

PDF

Open Access 1 Video

TL;DR

This paper introduces C-CLIP, a contrastive image-text encoder trained on commentative pairs, significantly improving social media content understanding and retrieval, especially in non-English languages.

Contribution

The paper presents a novel training approach for CLIP models using commentative pairs, addressing the gap in social media content understanding and enhancing multilingual retrieval performance.

Findings

01

Large improvements in retrieval accuracy on social media data

02

Effective performance across multiple non-English languages

03

Demonstrated benefits for social media analysis and OSINT applications

Abstract

The interplay between the image and comment on a social media post is one of high importance for understanding its overall message. Recent strides in multimodal embedding models, namely CLIP, have provided an avenue forward in relating image and text. However the current training regime for CLIP models is insufficient for matching content found on social media, regardless of site or language. Current CLIP training data is based on what we call ``descriptive'' text: text in which an image is merely described. This is something rarely seen on social media, where the vast majority of text content is ``commentative'' in nature. The captions provide commentary and broader context related to the image, rather than describing what is in it. Current CLIP models perform poorly on retrieval tasks where image-caption pairs display a commentative relationship. Closing this gap would be beneficial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

C-CLIP: Contrastive Image-Text Encoders To Close the Descriptive-Commentative Gap· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsContrastive Language-Image Pre-training