Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization
Yuhang Zang, Hanlin Goh, Josh Susskind, Chen Huang

TL;DR
This paper introduces OGEN, a novel method that uses class-conditional feature generation and self-distillation to improve out-of-distribution generalization in vision-language model finetuning, addressing overfitting to known classes.
Contribution
The paper proposes OGEN, a new approach combining feature synthesis and adaptive self-distillation to enhance OOD generalization during vision-language model finetuning.
Findings
OGEN improves OOD generalization performance across various settings.
The method effectively prevents overfitting to known classes.
Synthesized features aid in regularizing decision boundaries.
Abstract
Existing vision-language models exhibit strong generalization on a variety of visual domains and tasks. However, such models mainly perform zero-shot recognition in a closed-set manner, and thus struggle to handle open-domain visual concepts by design. There are recent finetuning methods, such as prompt learning, that not only study the discrimination between in-distribution (ID) and out-of-distribution (OOD) samples, but also show some improvements in both ID and OOD accuracies. In this paper, we first demonstrate that vision-language models, after long enough finetuning but without proper regularization, tend to overfit the known classes in the given dataset, with degraded performance on unknown classes. Then we propose a novel approach OGEN to address this pitfall, with the main focus on improving the OOD GENeralization of finetuned models. Specifically, a class-conditional feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies
MethodsFocus
