Concept Bottleneck Language Models For protein design

Aya Abdelsalam Ismail; Tuomas Oikarinen; Amy Wang; Julius Adebayo,; Samuel Stanton; Taylor Joren; Joseph Kleinhenz; Allen Goodman; H\'ector; Corrada Bravo; Kyunghyun Cho; Nathan C. Frey

arXiv:2411.06090·cs.LG·December 12, 2024·2 cites

Concept Bottleneck Language Models For protein design

Aya Abdelsalam Ismail, Tuomas Oikarinen, Amy Wang, Julius Adebayo,, Samuel Stanton, Taylor Joren, Joseph Kleinhenz, Allen Goodman, H\'ector, Corrada Bravo, Kyunghyun Cho, Nathan C. Frey

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper introduces Concept Bottleneck Protein Language Models (CB-pLM), which enable interpretable, controllable protein generation without sacrificing performance, and scale from 24 million to 3 billion parameters.

Contribution

The paper presents the first scalable, generative concept bottleneck language models for proteins, combining interpretability, control, and high performance.

Findings

01

Achieved 3x larger control over protein properties compared to baselines.

02

Maintained comparable perplexity and downstream task performance.

03

Scaled models up to 3 billion parameters, the largest of its kind.

Abstract

We introduce Concept Bottleneck Protein Language Models (CB-pLM), a generative masked language model with a layer where each neuron corresponds to an interpretable concept. Our architecture offers three key benefits: i) Control: We can intervene on concept values to precisely control the properties of generated proteins, achieving a 3 times larger change in desired concept values compared to baselines. ii) Interpretability: A linear mapping between concept values and predicted tokens allows transparent analysis of the model's decision-making process. iii) Debugging: This transparency facilitates easy debugging of trained models. Our models achieve pre-training perplexity and downstream task performance comparable to traditional masked protein language models, demonstrating that interpretability does not compromise performance. While adaptable to any language model, we focus on masked…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

I believe this CB-pLM is the first model which applies concept bottleneck models to protein design, the authors show with strong evidence that CB-pLM effectively shifts concept distributions better than existing approaches such as C-pLMs and CC-pLMs. The paper is well written, and does a good job at showcasing the in-silico results for conditioning.

Weaknesses

A limitation of this approach is that biophysical concepts need to be explicitly defined for controlling design. The argument that a CB-pLM is more interpretable because of the training approach is less convincing, as the results appear to be evidence that the model learns the provided concepts correctly, and doesn’t expand to a new interpretation of known biological properties. Concepts used in the model are also straightforward to calculate and interpret the outputs of with bioinformatics tool

Reviewer 02Rating 6Confidence 2

Strengths

* Adapting concept-bottleneck generative models to masked language models is an interesting and novel idea. * Using integrated gradients to find tokens to resample to is an interesting and novel idea. * Writing is generally easy to follow. * This work focuses on improving controllability, interpretability, and debuggability of language models (and protein sequence in particular as an example) which is an important topic with large potential impact.

Weaknesses

Method (considering a broader application of the proposed method to general language/sequence modeling, as authors mentioned in the abstract) M.1 The model proposed is only a conditional generative model given a partial sequence rather than a full generative model of an entire sequence. Classical masked language models (like BERT) do not model the joint distribution over sequences [1]. There are ways to derive joints from MLMs by making their conditionals consistent with one another (see, e.g

Reviewer 03Rating 5Confidence 2

Strengths

Developing and testing new models for protein modeling is an important problem. In particular, having methods that can be interpretable and that can allow better control for drug design is a high impact project.

Weaknesses

The paper contains many typos, it is meandering and unfocused, and generally is very hard to read (with many repetitions and vaguely defined concepts). For example, what does “debuggable model“ mean? This is stated as one of the main contributions of the proposed architecture, hence the reader expects it to be clearly defined and thoroughly assessed through empirical validation. In the current version of the manuscript this is not properly defined and hence not convincingly validated with exp

Code & Models

Repositories

prescient-design/lobster
pytorchOfficial

Videos

Concept Bottleneck Language Models For Protein Design· slideslive

Taxonomy

TopicsMicrobial Metabolic Engineering and Bioproduction · Machine Learning in Bioinformatics · Biomedical Text Mining and Ontologies

MethodsFocus