Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models

Fabian Morelli; Arnas Uselis; Ankit Sonthalia; Seong Joon Oh

arXiv:2605.15961·cs.CV·May 18, 2026

Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models

Fabian Morelli, Arnas Uselis, Ankit Sonthalia, Seong Joon Oh

PDF

1 Repo

TL;DR

This paper introduces SAE-FT, a sparse autoencoder-based fine-tuning method for CLIP models that enhances robustness and interpretability without sacrificing performance, and is computationally efficient.

Contribution

SAE-FT is a novel fine-tuning approach that regularizes visual representations using sparse autoencoders, preventing catastrophic forgetting and enabling interpretability.

Findings

01

SAE-FT matches or exceeds state-of-the-art on ImageNet benchmarks.

02

It improves robustness against distribution shifts.

03

The method is computationally efficient.

Abstract

Large-scale pre-trained vision-language models like CLIP demonstrate remarkable zero-shot performance across diverse tasks. However, fine-tuning these models to improve downstream performance often degrades robustness against distribution shifts. Recent approaches have attempted to mitigate this trade-off, but often rely on computationally expensive text-guidance. We propose a novel method for robust fine-tuning, SAE-FT, which operates only on the model's visual representations. SAE-FT regularizes changes to these representations by penalizing the addition and removal of semantically meaningful features identified by a Sparse Autoencoder trained on the pre-trained model. This constraint prevents catastrophic forgetting and makes the fine-tuning process interpretable, enabling direct analysis of semantic changes. SAE-FT is both mechanistically transparent and computationally efficient,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Fabian-Mor/sae-ft
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.