COSMOS: Cross-Modality Self-Distillation for Vision Language   Pre-training

Sanghwan Kim; Rui Xiao; Mariana-Iuliana Georgescu; Stephan Alaniz,; Zeynep Akata

arXiv:2412.01814·cs.CV·March 27, 2025

COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training

Sanghwan Kim, Rui Xiao, Mariana-Iuliana Georgescu, Stephan Alaniz,, Zeynep Akata

PDF

Open Access 1 Repo 1 Models

TL;DR

COSMOS introduces a novel self-distillation framework with cross-attention and multi-modal augmentations, improving vision-language pre-training by capturing comprehensive image and text information for better downstream task performance.

Contribution

The paper proposes COSMOS, a self-distillation method with cross-attention and multi-modal augmentations, enhancing vision-language models beyond contrastive loss limitations.

Findings

01

Outperforms previous baselines on zero-shot retrieval, classification, and segmentation tasks.

02

Surpasses larger dataset CLIP models in visual perception and understanding.

03

Effective integration of cross-attention and multi-modal views improves cross-modal representations.

Abstract

Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of the contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ExplainableML/cosmos
pytorchOfficial

Models

🤗
sankim2/cosmos
model· 13 dl· ♡ 2
13 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSoftmax · Concatenated Skip Connection · Focus