Choosing Wisely and Learning Deeply: Selective Cross-Modality Distillation via CLIP for Domain Generalization
Jixuan Leng, Yijiang Li, Haohan Wang

TL;DR
This paper presents SCMD, a novel method leveraging CLIP for domain generalization by selectively distilling hard-to-learn samples, resulting in improved robustness across unseen domains.
Contribution
It introduces a unique sample selection framework and a cross-modality module that enhances domain generalization using large vision-language models.
Findings
SCMD achieves state-of-the-art performance on multiple benchmarks.
The selection strategy effectively identifies hard-to-learn samples.
Theoretical analysis supports the effectiveness of the selection method.
Abstract
Domain Generalization (DG), a crucial research area, seeks to train models across multiple domains and test them on unseen ones. In this paper, we introduce a novel approach, namely, Selective Cross-Modality Distillation for Domain Generalization (SCMD). SCMD leverages the capabilities of large vision-language models, specifically CLIP, to train a more efficient model, ensuring it acquires robust generalization capabilities across unseen domains. Our primary contribution is a unique selection framework strategically designed to identify hard-to-learn samples for distillation. In parallel, we introduce a novel cross-modality module that seamlessly combines the projected features of the student model with the text embeddings from CLIP, ensuring the alignment of similarity distributions. We assess SCMD's performance on various benchmarks, where it empowers a ResNet50 to deliver…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Topic Modeling
MethodsContrastive Language-Image Pre-training
