Concept Bottleneck Language Models For protein design
Aya Abdelsalam Ismail, Tuomas Oikarinen, Amy Wang, Julius Adebayo,, Samuel Stanton, Taylor Joren, Joseph Kleinhenz, Allen Goodman, H\'ector, Corrada Bravo, Kyunghyun Cho, Nathan C. Frey

TL;DR
This paper introduces Concept Bottleneck Protein Language Models (CB-pLM), which enable interpretable, controllable protein generation without sacrificing performance, and scale from 24 million to 3 billion parameters.
Contribution
The paper presents the first scalable, generative concept bottleneck language models for proteins, combining interpretability, control, and high performance.
Findings
Achieved 3x larger control over protein properties compared to baselines.
Maintained comparable perplexity and downstream task performance.
Scaled models up to 3 billion parameters, the largest of its kind.
Abstract
We introduce Concept Bottleneck Protein Language Models (CB-pLM), a generative masked language model with a layer where each neuron corresponds to an interpretable concept. Our architecture offers three key benefits: i) Control: We can intervene on concept values to precisely control the properties of generated proteins, achieving a 3 times larger change in desired concept values compared to baselines. ii) Interpretability: A linear mapping between concept values and predicted tokens allows transparent analysis of the model's decision-making process. iii) Debugging: This transparency facilitates easy debugging of trained models. Our models achieve pre-training perplexity and downstream task performance comparable to traditional masked protein language models, demonstrating that interpretability does not compromise performance. While adaptable to any language model, we focus on masked…
Peer Reviews
Decision·ICLR 2025 Poster
I believe this CB-pLM is the first model which applies concept bottleneck models to protein design, the authors show with strong evidence that CB-pLM effectively shifts concept distributions better than existing approaches such as C-pLMs and CC-pLMs. The paper is well written, and does a good job at showcasing the in-silico results for conditioning.
A limitation of this approach is that biophysical concepts need to be explicitly defined for controlling design. The argument that a CB-pLM is more interpretable because of the training approach is less convincing, as the results appear to be evidence that the model learns the provided concepts correctly, and doesn’t expand to a new interpretation of known biological properties. Concepts used in the model are also straightforward to calculate and interpret the outputs of with bioinformatics tool
* Adapting concept-bottleneck generative models to masked language models is an interesting and novel idea. * Using integrated gradients to find tokens to resample to is an interesting and novel idea. * Writing is generally easy to follow. * This work focuses on improving controllability, interpretability, and debuggability of language models (and protein sequence in particular as an example) which is an important topic with large potential impact.
Method (considering a broader application of the proposed method to general language/sequence modeling, as authors mentioned in the abstract) M.1 The model proposed is only a conditional generative model given a partial sequence rather than a full generative model of an entire sequence. Classical masked language models (like BERT) do not model the joint distribution over sequences [1]. There are ways to derive joints from MLMs by making their conditionals consistent with one another (see, e.g
Developing and testing new models for protein modeling is an important problem. In particular, having methods that can be interpretable and that can allow better control for drug design is a high impact project.
The paper contains many typos, it is meandering and unfocused, and generally is very hard to read (with many repetitions and vaguely defined concepts). For example, what does “debuggable model“ mean? This is stated as one of the main contributions of the proposed architecture, hence the reader expects it to be clearly defined and thoroughly assessed through empirical validation. In the current version of the manuscript this is not properly defined and hence not convincingly validated with exp
Code & Models
Videos
Taxonomy
TopicsMicrobial Metabolic Engineering and Bioproduction · Machine Learning in Bioinformatics · Biomedical Text Mining and Ontologies
MethodsFocus
