
TL;DR
This paper introduces a topology-aware tuning method for CLIP that aligns the topological structures of visual and textual representations, significantly improving few-shot learning performance across multiple datasets.
Contribution
It presents a novel approach integrating Representation Topology Divergence into the Task Residual framework, explicitly aligning topological structures to enhance few-shot learning.
Findings
Achieves 1-2% higher accuracy on 6 benchmark datasets.
Effectively leverages topological information for better adaptation.
Maintains pre-trained knowledge by freezing base encoders.
Abstract
Efficiently adapting large Vision-Language Models (VLMs) like CLIP for few-shot learning poses challenges in balancing pre-trained knowledge retention and task-specific adaptation. Existing methods often overlook valuable structural information within the VLM's latent space. We introduce a topology-aware tuning approach integrating Representation Topology Divergence (RTD) into the Task Residual (TR) framework. By explicitly aligning the topological structures of visual and text representations using a combined RTD and Cross-Entropy loss, while freezing base VLM encoders, our method enhances few-shot performance. We optimize only lightweight Task Residual parameters, effectively leveraging topological information. Across 6 diverse benchmark datasets, our approach demonstrates significant gains, achieving an average accuracy improvement of 1-2\% over relevant baseline methods in few-shot…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGeophysical Methods and Applications
MethodsBalanced Selection · Contrastive Language-Image Pre-training
