ENSEMBITS: an alphabet of protein conformational ensembles
Kaiwen Shi, Carlos Oliver

TL;DR
Ensembits is a novel protein conformational ensemble tokenizer that captures dynamic motions, outperforming existing static structure tokenizers in various predictive tasks and enabling dynamic information integration into protein language models.
Contribution
Introduces Ensembits, the first tokenizer for protein conformational ensembles, addressing challenges in encoding dynamics and demonstrating superior performance on multiple benchmarks.
Findings
Outperforms related methods on RMSF prediction.
Matches or exceeds static tokenizers on EC, GO, and binding site prediction.
Predicts dynamics from a single structure, reducing data sparsity.
Abstract
Protein structure tokenizers (PSTs) are workhorses in protein language modeling, function prediction, and evolutionary analysis. However, existing PSTs only capture local geometry of static structures, and miss the correlated motions and alternative conformational states revealed by protein ensembles. Here we introduce Ensembits, the first tokenizer of protein conformational ensembles. Ensembits address challenges inherent to tokenizing dynamics: deriving informative geometric descriptors across conformations, permutation-invariance encoding of variable-size ensembles, and conquering sparsity in dynamics data. Trained with a Residual VQ-VAE using a frame distillation objective on a large molecular dynamics corpus, Ensembits outperforms all related methods on RMSF prediction, and is the strongest standalone structural tokenizer on an token-conditioned ANOVA test on per-residue motion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
