TL;DR
SpecPL introduces a spectral perspective to prompt learning for vision-language models, disentangling visual signals into semantic and granular components with counterfactual supervision to improve fine-grained discrimination.
Contribution
It proposes a novel spectral approach using a frozen VAE and counterfactual granule training to enhance prompt learning in vision-language models.
Findings
Achieves state-of-the-art performance on 11 benchmarks.
Reaches a new harmonic-mean accuracy of 81.51%.
Effectively bridges the stability-generalization gap in prompt learning.
Abstract
Existing prompt learning for VLMs exhibits a modality asymmetry, predominantly optimizing text tokens while still relying on frozen visual encoder as holistic extractor and neglecting the spectral granularity essential for fine-grained discrimination. To bridge this, we introduce Disentangling Spectral Granularity for Prompt Learning (SpecPL), which approaches prompt learning from a novel spectral perspective via Counterfactual Granule Supervision. Specifically, we leverage a frozen VAE to decompose visual signals into semantic low-frequency bands and granular high-frequency details. A frozen Visual Semantic Bank anchors text representations to universal low-frequency invariants, mitigating overfitting. Crucially, fine-grained discrimination is driven by counterfactual granule training: by permuting high-frequency signals, we compel the model to explicitly distinguish visual granularity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
