Prototype Transformer: Towards Language Model Architectures Interpretable by Design
Yordan Yordanov, Matteo Forasassi, Bayar Menzat, Ruizhi Wang, Chang Qi, Markus Kaltenberger, Amine M'Charrak, Tommaso Salvatori, Thomas Lukasiewicz

TL;DR
ProtoT introduces a prototype-based transformer architecture that enhances interpretability and scalability of language models, enabling understanding of reasoning processes and targeted behavior edits while maintaining competitive performance.
Contribution
The paper presents ProtoT, a novel prototype transformer architecture that offers inherent interpretability and linear scalability, addressing opacity and efficiency issues of traditional self-attention models.
Findings
ProtoT captures nameable concepts during training.
ProtoT scales linearly with sequence length.
ProtoT performs well on text generation and downstream tasks.
Abstract
While state-of-the-art language models (LMs) surpass the vast majority of humans in certain domains, their reasoning remains largely opaque, undermining trust in their output. Furthermore, while autoregressive LMs can output explicit reasoning, their true reasoning process is opaque, which introduces risks like deception and hallucination. In this work, we introduce the Prototype Transformer (ProtoT) -- an autoregressive LM architecture based on prototypes (parameter vectors), posed as an alternative to the standard self-attention-based transformers. ProtoT works by means of two-way communication between the input sequence and the prototypes, and we show that this leads to the prototypes automatically capturing nameable concepts (e.g. "woman") during training. They provide the potential to interpret the model's reasoning and allow for targeted edits of its behavior. Furthermore, by…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper pursues an ambitious and timely goal: integrating interpretability directly into the LM architecture, rather than relying on post-hoc analyses. The introduction of prototype vectors as explicit representational channels is conceptually elegant, drawing inspiration from case-based reasoning and slot-attention architectures, and potentially bridging feature disentanglement and mechanistic interpretability.
- ProtoT essentially replaces self-attention with a learned prototype mixer that uses fixed latent vectors and exponential decay. While this is a clean design, similar ideas have been explored extensively in slot attention, Perceiver IO/AR, and prototype networks in both NLP and vision. The main mathematical formulation (Eq. 1) is a direct adaptation of cross-attention with time-discounting, without introducing new theoretical mechanisms for interpretability. The paper’s claim that prototypes “c
1. Introducing a prototype mechanism into black-box models is an interesting direction. It enhances interpretability and allows us to explicitly observe and selectively modify the concepts learned by the model. 2. The paper examines model robustness from multiple intervention perspectives and provides interpretability insights based on the routing mechanism, offering a more transparent view of the model’s internal reasoning process.
1. There is a noticeable performance gap with top-tier models. Except for RTE and WNLI, ProtoT still shows a significant performance gap compared to LLaMA, indicating substantial room for improvement in terms of model expressiveness and generalization ability. 2. Insufficient baseline comparisons: The experimental section lacks a systematic comparison with a broader range of mainstream methods, primarily comparing with relatively weaker models such as LLaMA. 3. Limited depth of interpretabili
- The aim of creating an inherently interpretable language model is very interesting and a difficult and under-explored problem. Prototype learning is an intuitive way to go about this, particularly given its success in computer vision. - The authors perform extensive training experiments, including very interesting scalability analysis. - Furthermore, the evaluation comparing ProtoTs to Llama/SSMs finding improved robustness and on par benchmark performance despite worse perplexity is also quit
- It is unclear to me how well the ProtoTs function as language models. While the perplexity is clearly worse, examples of text completions and general language modeling utility would be helpful to provide intuition as to how meaningful in the difference is. - How do the authors test the actual interpretability of the learned prototypes? An automated interpretability evaluation pipeline (for example, see SAEBench [1]) could be a way to do this. - The downstream utility of ProtoTs, and more gene
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications
