From Static Structures to Ensembles: Studying and Harnessing Protein Structure Tokenization
Zijing Liu, Bin Feng, He Cao, Yu Li

TL;DR
This paper investigates protein structure tokenization, revealing semantic redundancy in structural vocabularies and introducing a simple method to generate diverse conformational ensembles, thereby advancing protein flexibility modeling.
Contribution
It uncovers the semantic redundancy in structural tokens and proposes a lightweight 'synonym swap' method to simulate protein conformational diversity.
Findings
Structural tokens exhibit significant semantic redundancy.
Pre-trained sequence embeddings are crucial for effective structure prediction.
The synonym swap method accurately models protein flexibility.
Abstract
Protein structure tokenization converts 3D structures into discrete or vectorized representations, enabling the integration of structural and sequence data. Despite many recent works on structure tokenization, the properties of the underlying discrete representations are not well understood. In this work, we first demonstrate that the successful utilization of structural tokens in a language model for structure prediction depends on using rich, pre-trained sequence embeddings to bridge the semantic gap between the sequence and structural "language". The analysis of the structural vocabulary itself then reveals significant semantic redundancy, where multiple distinct tokens correspond to nearly identical local geometries, acting as "structural synonyms". This redundancy, rather than being a flaw, can be exploited with a simple "synonym swap" strategy to generate diverse conformational…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsProtein Structure and Dynamics · Bioinformatics and Genomic Networks · Machine Learning in Bioinformatics
