Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders

James Oldfield; Shawn Im; Sharon Li; Mihalis A. Nicolaou; Ioannis Patras; Grigorios G Chrysos

arXiv:2505.21364·cs.LG·January 15, 2026

Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders

James Oldfield, Shawn Im, Sharon Li, Mihalis A. Nicolaou, Ioannis Patras, Grigorios G Chrysos

PDF

Open Access 1 Video

TL;DR

This paper introduces Mixture of Decoders (MxDs), a layer-level sparsity method for dense neural networks that maintains high accuracy and interpretability by expanding layers into specialized sublayers with full-rank weights.

Contribution

The paper proposes MxDs, a novel layer-level sparsity approach that preserves model expressiveness and improves interpretability without sacrificing accuracy in language models.

Findings

01

MxDs outperform state-of-the-art methods on the sparsity-accuracy trade-off.

02

MxDs preserve the expressive capacity of dense layers under heavy sparsity.

03

MxDs learn specialized features similar to natural language representations.

Abstract

Multilayer perceptrons (MLPs) are an integral part of large language models, yet their dense representations render them difficult to understand, edit, and steer. Recent methods learn interpretable approximations via neuron-level sparsity, yet fail to faithfully reconstruct the original mapping--significantly increasing model's next-token cross-entropy loss. In this paper, we advocate for moving to layer-level sparsity to overcome the accuracy trade-off in sparse layer approximation. Under this paradigm, we introduce Mixture of Decoders (MxDs). MxDs generalize MLPs and Gated Linear Units, expanding pre-trained dense layers into tens of thousands of specialized sublayers. Through a flexible form of tensor factorization, each sparsely activating MxD sublayer implements a linear transformation with full-rank weights--preserving the original decoders' expressive capacity even under heavy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders· slideslive

Taxonomy

TopicsAdversarial Robustness in Machine Learning