Bilinear MLPs enable weight-based mechanistic interpretability

Michael T. Pearce; Thomas Dooms; Alice Rigg; Jose M. Oramas; Lee Sharkey

arXiv:2410.08417·cs.LG·June 26, 2025

Bilinear MLPs enable weight-based mechanistic interpretability

Michael T. Pearce, Thomas Dooms, Alice Rigg, Jose M. Oramas, Lee Sharkey

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper introduces bilinear MLPs, a type of neural network layer without element-wise nonlinearities, enabling direct interpretability of weights and revealing insights into model computations across various tasks.

Contribution

The paper demonstrates that bilinear MLPs allow for weight-based interpretability, providing a fully linear framework to analyze and understand neural network weights.

Findings

01

Bilinear MLPs achieve competitive performance without element-wise nonlinearities.

02

Spectral analysis reveals low-rank interpretable structures in weights.

03

Weight-based interpretability enables analysis of adversarial examples and overfitting.

Abstract

A mechanistic understanding of how MLPs do computation in deep neural networks remains elusive. Current interpretability work can extract features from hidden activations over an input dataset but generally cannot explain how MLP weights construct features. One challenge is that element-wise nonlinearities introduce higher-order interactions and make it difficult to trace computations through the MLP layer. In this paper, we analyze bilinear MLPs, a type of Gated Linear Unit (GLU) without any element-wise nonlinearity that nevertheless achieves competitive performance. Bilinear MLPs can be fully expressed in terms of linear operations using a third-order tensor, allowing flexible analysis of the weights. Analyzing the spectra of bilinear MLP weights using eigendecomposition reveals interpretable low-rank structure across toy tasks, image classification, and language modeling. We use…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 8Confidence 4

Strengths

ORIGINALITY: This work distinguishes itself from existing approaches to interpretable neural networks by using a unique characteristic of bilinear MLPs. While previous works have utilized bilinear MLP networks, this paper uniquely demonstrates how to exploit their inherent differences from other MLPs to achieve interpretable neural networks. SIGNIFICANCE: Interpretable AI is crucial for understanding how AI models encode information. This work offers a new framework for interpretable neural ne

Weaknesses

This is a minor suggestion: This paper introduces the use of sparse autoencoders in Section 3.1. While the stated limitations offer justification for this approach, a more detailed explanation of the role of sparse autoencoders in Section 3.1 would enhance the clarity and comprehensiveness of the work.

Reviewer 02Rating 8Confidence 2

Strengths

Originality: While the ideas of using bilinear layers and use of eigenanalysis on the corresponding reduced tensor are borrowed from past work (in particular, “A technical note on bilinear layers for interpretability), the precise demonstrations are new, to my knowledge. Clarity: The paper is well-written, making it easy to follow (even for a reader like me, less familiar with this area of work). The Discussion section well-summarizes the contributions, contextualizes the implications, and id

Weaknesses

The illustrations in the submissions are some fairly reasonable demonstrations of the interpretability-advantages of networks with bilinear layers. However, these ideas may not be mature enough, hence the lack of robust demonstrations at a broader range of tasks, models, and circuits. Being able to showcase such range would significantly strengthen the influence of the work in the nearer future. Perhaps some of the existing problems can be solved with advances in discovering semantic “output di

Reviewer 03Rating 8Confidence 3

Strengths

Good exhibition of content in all sections, both setup and presentation of results Effective choice of experiments with convincing examples and helpful visualization. Work is significant: Presents analysis using well understood methods to produce precise, understandable and actionable interpretations of models with competitive performance.

Weaknesses

I did not notice much analysis using SVD, perhaps I missed it.

Code & Models

Repositories

tdooms/bilinear-decomposition
jaxOfficial

Videos

Bilinear MLPs enable weight-based mechanistic interpretability· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Machine Learning and Data Classification