Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes
Aly Kassem, Thomas Jiralerspong, Negar Rostamzadeh, Golnoosh Farnadi

TL;DR
Delta-Crosscoder is a novel method for accurately identifying and mitigating localized behavioral changes in models caused by narrow fine-tuning, improving interpretability and control.
Contribution
It introduces a delta-based loss and sparsity constraints to enhance crosscoder effectiveness in narrow fine-tuning regimes, outperforming existing baselines.
Findings
Reliable identification of fine-tuning induced directions across diverse models
Outperforms SAE-based baselines in isolating causal latent directions
Effective mitigation of behavioral changes in models
Abstract
Model diffing methods aim to identify how fine-tuning changes a model's internal representations. Crosscoders approach this by learning shared dictionaries of interpretable latent directions between base and fine-tuned models. However, existing formulations struggle with narrow fine-tuning, where behavioral changes are localized and asymmetric. We introduce Delta-Crosscoder, which combines BatchTopK sparsity with a delta-based loss prioritizing directions that change between models, plus an implicit contrastive signal from paired activations on matched inputs. Evaluated across 10 model organisms, including synthetic false facts, emergent misalignment, subliminal learning, and taboo word guessing (Gemma, LLaMA, Qwen; 1B-9B parameters), Delta-Crosscoder reliably isolates latent directions causally responsible for fine-tuned behaviors and enables effective mitigation, outperforming…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsChild and Animal Learning Development · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
