Semantic-guided Fine-tuning of Foundation Model for Long-tailed Visual Recognition
Yufei Peng, Yonggang Zhang, Yiu-ming Cheung

TL;DR
This paper introduces Sage, a semantic-guided fine-tuning method for foundation models that improves long-tailed visual recognition by aligning visual and textual modalities and addressing distribution mismatch bias.
Contribution
The paper proposes a novel SG-Adapter and a distribution mismatch-aware compensation factor to enhance semantic alignment and rectify bias in long-tailed visual recognition.
Findings
Sage significantly improves performance on long-tailed datasets.
Semantic guidance enhances visual-textual alignment.
The compensation factor effectively reduces prediction bias.
Abstract
The variance in class-wise sample sizes within long-tailed scenarios often results in degraded performance in less frequent classes. Fortunately, foundation models, pre-trained on vast open-world datasets, demonstrate strong potential for this task due to their generalizable representation, which promotes the development of adaptive strategies on pre-trained models in long-tailed learning. Advanced fine-tuning methods typically adjust visual encoders while neglecting the semantics derived from the frozen text encoder, overlooking the visual and textual alignment. To strengthen this alignment, we propose a novel approach, Semantic-guided fine-tuning of foundation model for long-tailed visual recognition (Sage), which incorporates semantic guidance derived from textual modality into the visual fine-tuning process. Specifically, we introduce an SG-Adapter that integrates class descriptions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Image Processing and 3D Reconstruction · Advanced Image and Video Retrieval Techniques
