FoldToken2: Learning compact, invariant and generative protein structure   language

Zhangyang Gao; Cheng Tan; Stan Z. Li

arXiv:2407.00050·q-bio.BM·July 2, 2024

FoldToken2: Learning compact, invariant and generative protein structure language

Zhangyang Gao, Cheng Tan, Stan Z. Li

PDF

Open Access

TL;DR

FoldToken2 introduces a novel discrete token-based representation for protein structures that is compact, invariant, and capable of high-fidelity reconstruction, advancing the field of protein structure modeling.

Contribution

It significantly improves structure encoding, compression, and decoding over previous methods, enabling effective representation of both single-chain and multi-chain proteins.

Findings

01

Outperforms FoldToken1 by 20% in TMScore

02

Achieves 81% reduction in RMSD

03

Effective on both single-chain and multi-chain structures

Abstract

The equivalent nature of 3D coordinates has posed long term challenges in protein structure representation learning, alignment, and generation. Can we create a compact and invariant language that equivalently represents protein structures? Towards this goal, we propose FoldToken2 to transfer equivariant structures into discrete tokens, while maintaining the recoverability of the original structures. From FoldToken1 to FoldToken2, we improve three key components: (1) invariant structure encoder, (2) vector-quantized compressor, and (3) equivalent structure decoder. We evaluate FoldToken2 on the protein structure reconstruction task and show that it outperforms previous FoldToken1 by 20\% in TMScore and 81\% in RMSD. FoldToken2 probably be the first method that works well on both single-chain and multi-chain protein structures quantization. We believe that FoldToken2 will inspire further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Bioinformatics · Protein Structure and Dynamics · Biomedical Text Mining and Ontologies