SAE as a Crystal Ball: Interpretable Features Predict Cross-domain Transferability of LLMs without Training

Qi Zhang; Yifei Wang; Xiaohan Wang; Jiajun Chai; Guojun Yin; Wei Lin; Yisen Wang

arXiv:2603.02908·cs.AI·March 4, 2026

SAE as a Crystal Ball: Interpretable Features Predict Cross-domain Transferability of LLMs without Training

Qi Zhang, Yifei Wang, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Yisen Wang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces the SAE-based Transferability Score (STS), a novel metric using sparse autoencoders to predict how well large language models will transfer across domains after post-training, without requiring actual fine-tuning.

Contribution

The paper proposes STS, an interpretable autoencoder-based metric that forecasts LLM transferability across domains before training, filling a gap in understanding post-training model shifts.

Findings

01

STS achieves Pearson correlation > 0.7 with actual transfer performance.

02

It accurately predicts transferability in supervised fine-tuning scenarios.

03

Initial steps show potential for extending STS to reinforcement learning.

Abstract

In recent years, pre-trained large language models have achieved remarkable success across diverse tasks. Besides the pivotal role of self-supervised pre-training, their effectiveness in downstream applications also depends critically on the post-training process, which adapts models to task-specific data and objectives. However, this process inevitably introduces model shifts that can influence performance in different domains, and how such shifts transfer remains poorly understood. To open up the black box, we propose the SAE-based Transferability Score (STS), a new metric that leverages sparse autoencoders (SAEs) to forecast post-training transferability. Taking supervised fine-tuning as an example, STS identifies shifted dimensions in SAE representations and calculates their correlations with downstream domains, enabling reliable estimation of transferability \textit{before}…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

* Addresses an interesting and timely problem - forecasting post-training transfer effects for LLMS * Builds on recent interpretability work with SAEs * Correlations in the experiments are consistent suggesting the metric captures a genuine phenomenon

Weaknesses

* Limited conceptual novelty. Reframes existing ideas on representation drift and feature correlation under transferability and SAEs. Lacks a theoretical link between SAE features and fine tuning. * Limited empirical evidence. While the experiments presented are indictive of some trend, the central hypothesis is only tested on one dataset and adaptation direction. The scope is too narrow to lend credible evidence to a correlation between ICL feature drift and SFT * Lack of baselines. No comparis

Reviewer 02Rating 4Confidence 2

Strengths

1. The reviewer finds the topic interesting and timely, and the proposed technique could be useful in LLM training workflows. 2. The experiment design is generally sound, and the application demonstrated in section 5 is well-motivated.

Weaknesses

The reviewer feels that this paper has the potential to have greater impact, but is limited by its current presentation and the scope of the experiments. 1. Details on the experiment setup/results are lacking. - It appears that all experiments are performed once; ideally, it should be repeated over different random seeds (e.g., for initialization and train/test split) and report the mean + std. - Since SAE should be interpretable, a qualitative analysis of the identified SAE activations would b

Reviewer 03Rating 4Confidence 3

Strengths

S1: The paper focuses on an interesting problem of forecasting changes in downstream task performance without additional training. S2: The proposed fine-tuning-free approach for estimating transferability is both novel and useful. S3: The investigation of SAE dimension shifts under both fine-tuning and in-context learning (ICL) is insightful. S4: The paper evaluates the proposed STS method across multiple major open models (Qwen2.5-7B, Llama3-8B, Gemma2-9B) and demonstrates consistently high

Weaknesses

W1: The study is limited to a single training dataset (LIMO) and a single evaluation benchmark (MMLU-Pro). Broader domains (e.g., dialogue, code generation) and larger model scales remain underexplored, even though these are areas where STS would likely be most valuable. W2: There are potential reproducibility issues, as details on the SAE architectures, training procedures, hyperparameters, and prompt templates are either missing or insufficiently explained. W3: Methodologically, STS relies h

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Artificial Intelligence in Healthcare and Education