Atoms as Language: VQ-Atom: Semantic Discretization for Molecular Representation Learning
Takayuki Kimura

TL;DR
VQ-Atom introduces a semantic discretization method converting atom-level graph representations into meaningful tokens, improving molecular language modeling for drug discovery tasks.
Contribution
The paper presents VQ-Atom, a novel framework that creates chemically meaningful discrete tokens from continuous molecular graphs for enhanced language-based molecular learning.
Findings
VQ-Atom improves protein-ligand interaction prediction accuracy.
Semantic discretization enhances molecular representation learning.
Method outperforms conventional tokenization approaches.
Abstract
Molecular representation learning has become a central approach in AI-driven drug discovery, yet existing molecular tokenizations such as SMILES remain largely syntactic and do not naturally align with chemically meaningful substructures. In this work, we introduce VQ-Atom, a semantic discretization framework that converts continuous atom-level graph representations into discrete tokens corresponding to local chemical environments. Using graph neural network embeddings and vector quantization, atoms are assigned to codebook entries representing chemically meaningful atomic contexts. These discrete tokens define a molecular language suitable for Transformer-based pretraining. We evaluate VQ-Atom in protein-ligand interaction prediction under a protein-cold split setting without relying on 3D structural information. Experimental results show that VQ-Atom consistently improves predictive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
