FoldToken2: Learning compact, invariant and generative protein structure language
Zhangyang Gao, Cheng Tan, Stan Z. Li

TL;DR
FoldToken2 introduces a novel discrete token-based representation for protein structures that is compact, invariant, and capable of high-fidelity reconstruction, advancing the field of protein structure modeling.
Contribution
It significantly improves structure encoding, compression, and decoding over previous methods, enabling effective representation of both single-chain and multi-chain proteins.
Findings
Outperforms FoldToken1 by 20% in TMScore
Achieves 81% reduction in RMSD
Effective on both single-chain and multi-chain structures
Abstract
The equivalent nature of 3D coordinates has posed long term challenges in protein structure representation learning, alignment, and generation. Can we create a compact and invariant language that equivalently represents protein structures? Towards this goal, we propose FoldToken2 to transfer equivariant structures into discrete tokens, while maintaining the recoverability of the original structures. From FoldToken1 to FoldToken2, we improve three key components: (1) invariant structure encoder, (2) vector-quantized compressor, and (3) equivalent structure decoder. We evaluate FoldToken2 on the protein structure reconstruction task and show that it outperforms previous FoldToken1 by 20\% in TMScore and 81\% in RMSD. FoldToken2 probably be the first method that works well on both single-chain and multi-chain protein structures quantization. We believe that FoldToken2 will inspire further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Protein Structure and Dynamics · Biomedical Text Mining and Ontologies
