Bio2Token: All-atom tokenization of any biomolecular structure with Mamba
Andrew Liu, Axel Elaldi, Nathan Russell, Olivia Viessmann

TL;DR
Bio2Token introduces an atom-level tokenization method for large biomolecular structures, enabling high-fidelity encoding and scalable modeling for proteins, RNA, and small molecules, advancing biomolecular design capabilities.
Contribution
It presents a novel quantized auto-encoder approach for atom-level tokenization of entire biomolecules, scalable to nearly 100,000 atoms, with efficient architecture compared to existing models.
Findings
Achieves reconstruction accuracy below 1 Angstrom.
Scales to systems with nearly 100,000 atoms.
Uses a simple Mamba state space model architecture.
Abstract
Efficient encoding and representation of large 3D molecular structures with high fidelity is critical for biomolecular design applications. Despite this, many representation learning approaches restrict themselves to modeling smaller systems or use coarse-grained approximations of the systems, for example modeling proteins at the resolution of amino acid residues rather than at the level of individual atoms. To address this, we develop quantized auto-encoders that learn atom-level tokenizations of complete proteins, RNA and small molecule structures with reconstruction accuracies well below 1 Angstrom. We demonstrate that a simple Mamba state space model architecture is efficient compared to an SE(3)-invariant IPA architecture, reaches competitive accuracies and can scale to systems with almost 100,000 atoms. The learned structure tokens of bio2token may serve as the input for all-atom…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRNA and protein synthesis mechanisms · Origins and Evolution of Life
MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces
