Learning the Language of Protein Structure
Benoit Gaujac, J\'er\'emie Don\`a, Liviu Copoiu, Timothy Atkinson,, Thomas Pierrot, Thomas D. Barrett

TL;DR
This paper introduces a vector-quantized autoencoder for protein structures, converting complex 3D data into discrete tokens, enabling effective structure modeling and generation with high fidelity, advancing computational protein design.
Contribution
The paper presents a novel vector-quantized autoencoder that tokenizes protein structures into discrete representations, facilitating improved modeling and generation of protein structures.
Findings
High-fidelity reconstructions with RMSD of 1-5 Å
A GPT model trained on codebooks can generate diverse protein structures
The method bridges the gap between sequence and structure modeling
Abstract
Representation learning and \emph{de novo} generation of proteins are pivotal computational biology tasks. Whilst natural language processing (NLP) techniques have proven highly effective for protein sequence modelling, structure modelling presents a complex challenge, primarily due to its continuous and three-dimensional nature. Motivated by this discrepancy, we introduce an approach using a vector-quantized autoencoder that effectively tokenizes protein structures into discrete representations. This method transforms the continuous, complex space of protein structures into a manageable, discrete format with a codebook ranging from 4096 to 64000 tokens, achieving high-fidelity reconstructions with backbone root mean square deviations (RMSD) of approximately 1-5 \AA. To demonstrate the efficacy of our learned representations, we show that a simple GPT model trained on our codebooks can…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The problem is very well-motivated: Tokenizing 3D structures allows for multi-modal integration in downstream language models. 2. To the best of my knowledge, autoencoder-based tokenizers are pretty novel in the field of computational protein biology. 3. Experiments, though limited, are sufficient to demonstrate that their tokens can be used for both reconstruction and downstream language modeling.
1. An ablation study comparing the FS quantization with VQ would help motivate the architectural choice. 2. You can include an ablation study demonstrating the importance of invariance in your encoder architecture. Consider trying a non-invariant architecture and see how that impacts downstream language modeling. 3. The downsampling of the sequence makes it harder for the tokenizer to reconstruct the structure, but also makes it easier for the language model to learn. It would be great to demon
* Structure tokenization is an important area of research, setting the foundational work necessary for many downstream uses. * Considering the clustered nature of biological data during training is a good practice that should be more commonly adopted in the field * Investigating the optimal codebook size is useful both for understanding proteins and for building similar tools
There have been a number of structure tokenization methods introduced, including all-atom ones [1,2] and backbone-only [3,4]. These should at minimum be cited, and preferably include a baseline comparison to. It's certainly not fair to discount the novelty of a piece of work because other works have been since preprinted (I believe a version of this work was preprinted earlier than some of the others listed below). However, there are many open questions that authors could dig into, such as: * B
The paper makes a clear and strong case for discretizing protein backbone structures in order to use discrete sequence models to analyze and generate protein backbones. While the individual parts (discretizing protein structure, generating protein structures, using sequence modeling architectures on discretized protein structures) are present in previous work, this work is unique in its focus, thoroughness, and attention to detail on the quality of the discretization and reconstruction. The main
The two biggest areas for improvement are in explaining the paper's significance in light of the (missing) related work in this area and on the completeness of the ablations and metrics reported. I think this could be a really strong paper if most of these weaknesses are addressed. ## Related work The authors should mention and describe how their work is different from [ProTokens](https://www.biorxiv.org/content/10.1101/2023.11.27.568722v4.abstract), which also aims to discretize protein stru
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Natural Language Processing Techniques · Machine Learning in Bioinformatics
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dropout · Dense Connections · Softmax · Layer Normalization · Cosine Annealing · Discriminative Fine-Tuning · Attention Dropout · Linear Layer
