fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery
Andreas D. Demou, Panagiotis Koromilas, James Oldfield, Yannis Panagakis, Mihalis A. Nicolaou

TL;DR
This paper introduces fmxcoders, a novel method for cross-layer feature discovery in Transformers, which significantly improves the interpretability and coherence of learned features compared to standard approaches.
Contribution
The paper proposes fmxcoders, incorporating low-rank tensor factorizations and stochastic layer masking, to address structural limitations of standard crosscoders and enhance cross-layer feature recovery.
Findings
fmxcoders increase mean probing F1 by 10-30 points across models
reduce reconstruction MSE by 25-50%
double the mean functional coherence of latents
Abstract
Many features in pretrained Transformers span multiple layers: they emerge through stages of inference, persist in the residual stream, or are built jointly by parallel MLPs. Crosscoders (namely, sparse dictionaries trained jointly across layers) aim to recover these cross-layer features in a single shared latent space. We show that standard crosscoders largely fail at this purpose. Although their decoder weight norms spread evenly across layers, a functional coherence metric we introduce reveals that each latent's activation is effectively driven by only one or two layers on average. While functionally coherent latents act as human-interpretable concept detectors (e.g., US states and cities), the layer-localized latents that crosscoders predominantly learn collapse onto surface-level patterns such as digit detectors. We trace this failure to two structural limitations: unconstrained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
