Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs
Thomas Jiralerspong, Trenton Bricken

TL;DR
This paper introduces a novel method called Dedicated Feature Crosscoders (DFCs) for unsupervised cross-architecture model diffing, enabling the comparison of diverse large language models to uncover meaningful behavioral differences.
Contribution
It is the first to apply crosscoders to cross-architecture model diffing and proposes DFCs to better isolate unique features across different model architectures.
Findings
Identified Chinese Communist Party alignment in Qwen3-8B and Deepseek-R1-0528-Qwen3-8B
Detected American exceptionalism in Llama3.1-8B-Instruct
Uncovered a copyright refusal mechanism in GPT-OSS-20B
Abstract
Model diffing, the process of comparing models' internal representations to identify their differences, is a promising approach for uncovering safety-critical behaviors in new models. However, its application has so far been primarily focused on comparing a base model with its finetune. Since new LLM releases are often novel architectures, cross-architecture methods are essential to make model diffing widely applicable. Crosscoders are one solution capable of cross-architecture model diffing but have only ever been applied to base vs finetune comparisons. We provide the first application of crosscoders to cross-architecture model diffing and introduce Dedicated Feature Crosscoders (DFCs), an architectural modification designed to better isolate features unique to one model. Using this technique, we find in an unsupervised fashion features including Chinese Communist Party alignment in…
Peer Reviews
Decision·Submitted to ICLR 2026
Important problem: systematic methods for model diffing across different architectures are valuable.
Limited novelty w.r.t. Crosscoders, amplified with negligible gains. The method appears to be a modest modification of existing Crosscoders to encourage exclusivity. The paper does not convincingly argue why vanilla Crosscoders are fundamentally unable to isolate exclusive features, nor provide theory or diagnostics showing the failure mode that DFC fixes. Empirical gains (e.g., Fig. 3) are very small, questioning the modification further. Finally, as experiments show, existing Crosscoders work
- Clear motivation: model diffing to discover unknown behaviors not covered by evaluation suites - Novel approach, very timely and original - Clear explanations of shortcomings of previous methods
- Section 3.2.1. includes too much detail - The motivation for the design change of DFC's is not clear enough - A feature is not the same as a (propensity for a certain) behavior. How do you make sure that there are no other semantically different features that the method does not catch - features and concepts are not a 1:1 match - Section 3.3 is the most interesting but also entirely qualitative; are there any quantitative metrics you might report as well?
This paper presents an alternative architecture for model diiffing.
1. The study reported in this paper is not well-motivated. It is unclear why we need to diff models of different architectures. Diffing base models with their fine-tunes may help us understand the impacts of the fine-tuning. But this need is obviously not applicable for cross-model diffing. 2. The main part of this paper focuses on identifying model-exclusive features. First, I think identifying model-exclusive features is not specific to cross-model diffing, but also applicable to diffing base
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Adversarial Robustness in Machine Learning · Software Engineering Research
