C-CLIP: Contrastive Image-Text Encoders to Close the Descriptive-Commentative Gap
William Theisen, Walter Scheirer

TL;DR
This paper introduces C-CLIP, a contrastive image-text encoder trained on commentative pairs, significantly improving social media content understanding and retrieval, especially in non-English languages.
Contribution
The paper presents a novel training approach for CLIP models using commentative pairs, addressing the gap in social media content understanding and enhancing multilingual retrieval performance.
Findings
Large improvements in retrieval accuracy on social media data
Effective performance across multiple non-English languages
Demonstrated benefits for social media analysis and OSINT applications
Abstract
The interplay between the image and comment on a social media post is one of high importance for understanding its overall message. Recent strides in multimodal embedding models, namely CLIP, have provided an avenue forward in relating image and text. However the current training regime for CLIP models is insufficient for matching content found on social media, regardless of site or language. Current CLIP training data is based on what we call ``descriptive'' text: text in which an image is merely described. This is something rarely seen on social media, where the vast majority of text content is ``commentative'' in nature. The captions provide commentary and broader context related to the image, rather than describing what is in it. Current CLIP models perform poorly on retrieval tasks where image-caption pairs display a commentative relationship. Closing this gap would be beneficial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
C-CLIP: Contrastive Image-Text Encoders To Close the Descriptive-Commentative Gap· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsContrastive Language-Image Pre-training
