Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes

Aly Kassem; Thomas Jiralerspong; Negar Rostamzadeh; Golnoosh Farnadi

arXiv:2603.04426·cs.LG·March 6, 2026

Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes

Aly Kassem, Thomas Jiralerspong, Negar Rostamzadeh, Golnoosh Farnadi

PDF

Open Access

TL;DR

Delta-Crosscoder is a novel method for accurately identifying and mitigating localized behavioral changes in models caused by narrow fine-tuning, improving interpretability and control.

Contribution

It introduces a delta-based loss and sparsity constraints to enhance crosscoder effectiveness in narrow fine-tuning regimes, outperforming existing baselines.

Findings

01

Reliable identification of fine-tuning induced directions across diverse models

02

Outperforms SAE-based baselines in isolating causal latent directions

03

Effective mitigation of behavioral changes in models

Abstract

Model diffing methods aim to identify how fine-tuning changes a model's internal representations. Crosscoders approach this by learning shared dictionaries of interpretable latent directions between base and fine-tuned models. However, existing formulations struggle with narrow fine-tuning, where behavioral changes are localized and asymmetric. We introduce Delta-Crosscoder, which combines BatchTopK sparsity with a delta-based loss prioritizing directions that change between models, plus an implicit contrastive signal from paired activations on matched inputs. Evaluated across 10 model organisms, including synthetic false facts, emergent misalignment, subliminal learning, and taboo word guessing (Gemma, LLaMA, Qwen; 1B-9B parameters), Delta-Crosscoder reliably isolates latent directions causally responsible for fine-tuned behaviors and enables effective mitigation, outperforming…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsChild and Animal Learning Development · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning