Adaptive Protein Tokenization
Rohit Dilip, Ayush Varshney, David Van Valen

TL;DR
This paper introduces a global, adaptive protein tokenization method that enhances protein structure understanding, improves generative and representation tasks, and supports advanced applications like protein shrinking and affinity maturation.
Contribution
It presents a novel global tokenization approach for protein structures that overcomes limitations of local methods, enabling better generative, representation, and task-specific adaptation.
Findings
Outperforms existing local tokenizers in reconstruction and generative tasks.
Enables inference based on information content, improving designability.
Supports zero-shot protein shrinking and affinity maturation.
Abstract
Tokenization is a promising path to multi-modal models capable of jointly understanding protein sequences, structure, and function. Existing protein structure tokenizers create tokens by pooling information from local neighborhoods, an approach that limits their performance on generative and representation tasks. In this work, we present a method for global tokenization of protein structures in which successive tokens contribute increasing levels of detail to a global representation. This change resolves several issues with generative models based on local protein tokenization: it mitigates error accumulation, provides embeddings without sequence-reduction operations, and allows task-specific adaptation of a tokenized sequence's information content. We validate our method on reconstruction, generative, and representation tasks and demonstrate that it matches or outperforms existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsProtein Structure and Dynamics · Machine Learning in Bioinformatics · Genomics and Chromatin Dynamics
