TL;DR
This paper introduces SAE-FT, a sparse autoencoder-based fine-tuning method for CLIP models that enhances robustness and interpretability without sacrificing performance, and is computationally efficient.
Contribution
SAE-FT is a novel fine-tuning approach that regularizes visual representations using sparse autoencoders, preventing catastrophic forgetting and enabling interpretability.
Findings
SAE-FT matches or exceeds state-of-the-art on ImageNet benchmarks.
It improves robustness against distribution shifts.
The method is computationally efficient.
Abstract
Large-scale pre-trained vision-language models like CLIP demonstrate remarkable zero-shot performance across diverse tasks. However, fine-tuning these models to improve downstream performance often degrades robustness against distribution shifts. Recent approaches have attempted to mitigate this trade-off, but often rely on computationally expensive text-guidance. We propose a novel method for robust fine-tuning, SAE-FT, which operates only on the model's visual representations. SAE-FT regularizes changes to these representations by penalizing the addition and removal of semantically meaningful features identified by a Sparse Autoencoder trained on the pre-trained model. This constraint prevents catastrophic forgetting and makes the fine-tuning process interpretable, enabling direct analysis of semantic changes. SAE-FT is both mechanistically transparent and computationally efficient,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
