Bilinear MLPs enable weight-based mechanistic interpretability
Michael T. Pearce, Thomas Dooms, Alice Rigg, Jose M. Oramas, Lee Sharkey

TL;DR
This paper introduces bilinear MLPs, a type of neural network layer without element-wise nonlinearities, enabling direct interpretability of weights and revealing insights into model computations across various tasks.
Contribution
The paper demonstrates that bilinear MLPs allow for weight-based interpretability, providing a fully linear framework to analyze and understand neural network weights.
Findings
Bilinear MLPs achieve competitive performance without element-wise nonlinearities.
Spectral analysis reveals low-rank interpretable structures in weights.
Weight-based interpretability enables analysis of adversarial examples and overfitting.
Abstract
A mechanistic understanding of how MLPs do computation in deep neural networks remains elusive. Current interpretability work can extract features from hidden activations over an input dataset but generally cannot explain how MLP weights construct features. One challenge is that element-wise nonlinearities introduce higher-order interactions and make it difficult to trace computations through the MLP layer. In this paper, we analyze bilinear MLPs, a type of Gated Linear Unit (GLU) without any element-wise nonlinearity that nevertheless achieves competitive performance. Bilinear MLPs can be fully expressed in terms of linear operations using a third-order tensor, allowing flexible analysis of the weights. Analyzing the spectra of bilinear MLP weights using eigendecomposition reveals interpretable low-rank structure across toy tasks, image classification, and language modeling. We use…
Peer Reviews
Decision·ICLR 2025 Spotlight
ORIGINALITY: This work distinguishes itself from existing approaches to interpretable neural networks by using a unique characteristic of bilinear MLPs. While previous works have utilized bilinear MLP networks, this paper uniquely demonstrates how to exploit their inherent differences from other MLPs to achieve interpretable neural networks. SIGNIFICANCE: Interpretable AI is crucial for understanding how AI models encode information. This work offers a new framework for interpretable neural ne
This is a minor suggestion: This paper introduces the use of sparse autoencoders in Section 3.1. While the stated limitations offer justification for this approach, a more detailed explanation of the role of sparse autoencoders in Section 3.1 would enhance the clarity and comprehensiveness of the work.
Originality: While the ideas of using bilinear layers and use of eigenanalysis on the corresponding reduced tensor are borrowed from past work (in particular, “A technical note on bilinear layers for interpretability), the precise demonstrations are new, to my knowledge. Clarity: The paper is well-written, making it easy to follow (even for a reader like me, less familiar with this area of work). The Discussion section well-summarizes the contributions, contextualizes the implications, and id
The illustrations in the submissions are some fairly reasonable demonstrations of the interpretability-advantages of networks with bilinear layers. However, these ideas may not be mature enough, hence the lack of robust demonstrations at a broader range of tasks, models, and circuits. Being able to showcase such range would significantly strengthen the influence of the work in the nearer future. Perhaps some of the existing problems can be solved with advances in discovering semantic “output di
Good exhibition of content in all sections, both setup and presentation of results Effective choice of experiments with convincing examples and helpful visualization. Work is significant: Presents analysis using well understood methods to produce precise, understandable and actionable interpretations of models with competitive performance.
I did not notice much analysis using SVD, perhaps I missed it.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Machine Learning and Data Classification
