Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs

Thomas Jiralerspong; Trenton Bricken

arXiv:2602.11729·cs.AI·February 13, 2026

Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs

Thomas Jiralerspong, Trenton Bricken

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel method called Dedicated Feature Crosscoders (DFCs) for unsupervised cross-architecture model diffing, enabling the comparison of diverse large language models to uncover meaningful behavioral differences.

Contribution

It is the first to apply crosscoders to cross-architecture model diffing and proposes DFCs to better isolate unique features across different model architectures.

Findings

01

Identified Chinese Communist Party alignment in Qwen3-8B and Deepseek-R1-0528-Qwen3-8B

02

Detected American exceptionalism in Llama3.1-8B-Instruct

03

Uncovered a copyright refusal mechanism in GPT-OSS-20B

Abstract

Model diffing, the process of comparing models' internal representations to identify their differences, is a promising approach for uncovering safety-critical behaviors in new models. However, its application has so far been primarily focused on comparing a base model with its finetune. Since new LLM releases are often novel architectures, cross-architecture methods are essential to make model diffing widely applicable. Crosscoders are one solution capable of cross-architecture model diffing but have only ever been applied to base vs finetune comparisons. We provide the first application of crosscoders to cross-architecture model diffing and introduce Dedicated Feature Crosscoders (DFCs), an architectural modification designed to better isolate features unique to one model. Using this technique, we find in an unsupervised fashion features including Chinese Communist Party alignment in…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 0Confidence 4

Strengths

Important problem: systematic methods for model diffing across different architectures are valuable.

Weaknesses

Limited novelty w.r.t. Crosscoders, amplified with negligible gains. The method appears to be a modest modification of existing Crosscoders to encourage exclusivity. The paper does not convincingly argue why vanilla Crosscoders are fundamentally unable to isolate exclusive features, nor provide theory or diagnostics showing the failure mode that DFC fixes. Empirical gains (e.g., Fig. 3) are very small, questioning the modification further. Finally, as experiments show, existing Crosscoders work

Reviewer 02Rating 8Confidence 2

Strengths

- Clear motivation: model diffing to discover unknown behaviors not covered by evaluation suites - Novel approach, very timely and original - Clear explanations of shortcomings of previous methods

Weaknesses

- Section 3.2.1. includes too much detail - The motivation for the design change of DFC's is not clear enough - A feature is not the same as a (propensity for a certain) behavior. How do you make sure that there are no other semantically different features that the method does not catch - features and concepts are not a 1:1 match - Section 3.3 is the most interesting but also entirely qualitative; are there any quantitative metrics you might report as well?

Reviewer 03Rating 2Confidence 3

Strengths

This paper presents an alternative architecture for model diiffing.

Weaknesses

1. The study reported in this paper is not well-motivated. It is unclear why we need to diff models of different architectures. Diffing base models with their fine-tunes may help us understand the impacts of the fine-tuning. But this need is obviously not applicable for cross-model diffing. 2. The main part of this paper focuses on identifying model-exclusive features. First, I think identifying model-exclusive features is not specific to cross-model diffing, but also applicable to diffing base

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Adversarial Robustness in Machine Learning · Software Engineering Research