Tokenizing 3D Molecule Structure with Quantized Spherical Coordinates

Kaiyuan Gao; Yusong Wang; Haoxiang Guan; Zun Wang; Qizhi Pei; John E.; Hopcroft; Kun He; Lijun Wu

arXiv:2412.01564·cs.LG·December 3, 2024

Tokenizing 3D Molecule Structure with Quantized Spherical Coordinates

Kaiyuan Gao, Yusong Wang, Haoxiang Guan, Zun Wang, Qizhi Pei, John E., Hopcroft, Kun He, Lijun Wu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Mol-StrucTok, a novel method for tokenizing 3D molecular structures using spherical coordinates and vector quantization, enabling efficient and stable 3D molecule generation and property prediction.

Contribution

It presents a new 3D molecule tokenization scheme combining spherical coordinate notation with VQ-VAE, facilitating improved 3D molecular generation and property prediction.

Findings

01

Faster molecule generation with competitive stability.

02

Enhanced property prediction accuracy on QM9 dataset.

03

Versatile tokenization compatible with various molecular representations.

Abstract

The application of language models (LMs) to molecular structure generation using line notations such as SMILES and SELFIES has been well-established in the field of cheminformatics. However, extending these models to generate 3D molecular structures presents significant challenges. Two primary obstacles emerge: (1) the difficulty in designing a 3D line notation that ensures SE(3)-invariant atomic coordinates, and (2) the non-trivial task of tokenizing continuous coordinates for use in LMs, which inherently require discrete inputs. To address these challenges, we propose Mol-StrucTok, a novel method for tokenizing 3D molecular structures. Our approach comprises two key innovations: (1) We design a line notation for 3D molecules by extracting local atomic coordinates in a spherical coordinate system. This notation builds upon existing 2D line notations and remains agnostic to their…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 5

Strengths

1. This paper is well-written and easy to follow, with clear and informative tables and figures. 2. The proposed method performs well, especially on the conditional generation task. 3. The ablation study is thorough and provides useful insights.

Weaknesses

1. The proposed method is quite similar to existing methods, such as FoldSeek and FoldToken. Specifically, similar to the SE(3)-invariant spherical coordinates here, FoldSeek also uses distances and angles computed based on reference nodes as SE(3)-invariant representations. In addition, Furthermore, both methods employ VQ-VAE to learn discrete tokens. These overlapping components limit the novelty of this work. 2. About the datasets: the proposed method is only evaluated on QM9 dataset, which i

Reviewer 02Rating 5Confidence 4

Strengths

The combination of spherical line notation with vector quantization enables language models to process complex 3D data, which is challenging to discretize. This approach stands out from traditional graph-based or continuous-coordinate models by providing a discrete representation for language models without losing SE(3)-invariant information. Particularly, the augmented tokens incorporate both generation and understanding descriptors, including local spherical coordinates, bond lengths, and angl

Weaknesses

### Major The authors should clarify the rationale behind selecting exactly four neighbors for the atomic descriptor and explicitly address how the descriptor $\mathbf{z}_i$ is defined for atoms with fewer than four neighbors. This is essential, as molecules with varying coordination environments will likely have different numbers of neighbors, impacting the generality of the descriptor across datasets. ### Minor 1. The paper’s notations are somewhat inconsistent and could benefit from simplifi

Reviewer 03Rating 8Confidence 3

Strengths

1. The authors conduct an extensive set of experiments. They measure validity+uniqueness of generated molecules with different bond assignment methods, perform PoseBusters tests, evaluate quantum mechanical properties, and measure MAE for QM9 property prediction. They achieve state-of-the-art results in most experiments. 2. They also perform additional analysis regarding the inference speed of their method and the effect of the generation temperature on balancing quality and diversity.

Weaknesses

1. This is a hand-crafted tokenization scheme and should be compared to other tokenizers (e.g. BPE-based tokenizers), not just diffusion models and MPNN-based methods. 2. It may also be helpful to compare with structures expressed in other coordinate systems. I'd imagine that without SE(3) invariance there would be a wider range of possible tokenized sequences, making it harder for the GPT-2 model to learn.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNanofabrication and Lithography Techniques · Diatoms and Algae Research

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Layer Normalization · Linear Layer · Discriminative Fine-Tuning · Weight Decay · Attention Dropout · Residual Connection · Adam · Attention Is All You Need