Unlearning the Noisy Correspondence Makes CLIP More Robust
Haochen Han, Alex Jinpeng Wang, Peijun Ye, Fangming Liu

TL;DR
This paper introduces NCU, a fine-tuning method that unlearns noisy correspondences in pre-trained VLMs like CLIP, improving robustness and zero-shot transfer performance efficiently.
Contribution
The paper proposes a novel unlearning framework that directly targets noisy data in pre-trained models, reducing computational costs compared to previous refinement methods.
Findings
NCU improves CLIP's robustness across multiple tasks.
NCU outperforms existing robust pre-training methods in zero-shot transfer.
NCU achieves these gains with lower computational overhead.
Abstract
The data appetite for Vision-Language Models (VLMs) has continuously scaled up from the early millions to billions today, which faces an untenable trade-off with data quality and inevitably introduces Noisy Correspondence (NC) samples. Undoubtedly, such semantically unrelated data significantly impairs the performance of VLMs. Previous efforts mainly address this challenge by estimating refined alignment for more precise guidance. However, such resource-intensive pipelines that train VLMs from scratch struggle to meet realistic data demands. In this paper, we present a brand new perspective that seeks to directly eliminate the harmful effects of NC in pre-trained VLMs. Specifically, we propose NCU, a Noisy Correspondence Unlearning fine-tuning framework that efficiently enhances VLMs' robustness by forgetting learned noisy knowledge. The key to NCU is learning the hardest negative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
