COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
Sanghwan Kim, Rui Xiao, Mariana-Iuliana Georgescu, Stephan Alaniz,, Zeynep Akata

TL;DR
COSMOS introduces a novel self-distillation framework with cross-attention and multi-modal augmentations, improving vision-language pre-training by capturing comprehensive image and text information for better downstream task performance.
Contribution
The paper proposes COSMOS, a self-distillation method with cross-attention and multi-modal augmentations, enhancing vision-language models beyond contrastive loss limitations.
Findings
Outperforms previous baselines on zero-shot retrieval, classification, and segmentation tasks.
Surpasses larger dataset CLIP models in visual perception and understanding.
Effective integration of cross-attention and multi-modal views improves cross-modal representations.
Abstract
Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of the contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSoftmax · Concatenated Skip Connection · Focus
