TL;DR
This paper introduces a novel method leveraging pre-trained visual-language models to correct label-image mismatches in long-tailed, noisy datasets, significantly improving deep model robustness.
Contribution
It proposes Weak Teacher Supervision (WTS), utilizing cross-modal alignment to address label noise and distribution biases in long-tailed visual recognition.
Findings
WTS outperforms existing methods on synthetic and real-world datasets.
WTS maintains robustness under high-noise label conditions.
The approach effectively corrects label-image mismatches using auxiliary text information.
Abstract
Real-world data often exhibit long-tailed distributions with numerous noisy labels, substantially degrading the performance of deep models. While prior research has made progress in addressing this combined challenge, it overlooks the severe label-image mismatch inherent to high-noise settings, thereby limiting their effectiveness. Given that observed labels, though mismatched with images, still retain category information, we propose employing auxiliary text information from labels to address label-image inconsistencies in long-tailed noisy data. Specifically, we leverage the intrinsic cross-modal alignment in pre-trained visual-language models to correct the label-image inconsistencies. This supervisory signal, referred to as Weak Teacher Supervision (WTS), is unaffected by label noise and data distribution biases, albeit exhibits limited accuracy. Therefore, the activation of WTS is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
