InertialAR: Autoregressive 3D Molecule Generation with Inertial Frames
Haorui Li, Weitao Du, Yuqiang Li, Hongyu Guo, Shengchao Liu

TL;DR
InertialAR is a novel autoregressive model for 3D molecule generation that achieves state-of-the-art results by using invariant tokenization, geometric-aware attention, and hierarchical prediction of atom types and coordinates.
Contribution
It introduces a canonical tokenization aligned with inertial frames and a geometric rotary positional encoding for invariant and efficient 3D molecule modeling.
Findings
Achieves state-of-the-art on 7 of 10 metrics for unconditional molecule generation.
Outperforms baselines in controllable generation for targeted chemical functionalities.
Demonstrates strong performance across multiple benchmark datasets.
Abstract
Transformer-based autoregressive models have emerged as a unifying paradigm across modalities such as text and images, but their extension to 3D molecule generation remains underexplored. The gap stems from two fundamental challenges: (1) tokenizing molecules into a canonical 1D sequence of tokens that is invariant to both SE(3) transformations and atom index permutations, and (2) designing an architecture capable of modeling hybrid atom-based tokens that couple discrete atom types with continuous 3D coordinates. To address these challenges, we introduce InertialAR. InertialAR devises a canonical tokenization that aligns molecules to their inertial frames and reorders atoms to ensure SE(3) and permutation invariance. Moreover, InertialAR equips the attention mechanism with geometric awareness via geometric rotary positional encoding (GeoRoPE). In addition, it utilizes a hierarchical…
Peer Reviews
Decision·Submitted to ICLR 2026
- This paper proposes a novel 3D molecule generation framework. Some contributions like atom canonical ordering and keep SE(3) invariance by inertial frame based coordinate projection is very useful. - Generally the experimental results are good and promising. - The writing of this paper is good and clear.
- Some details need clarification. What is the ordering of eigenvalues in line 186? How ordering by the refined identifiers is done in line 213? - A major novelty contribution of this paper is the use of geometric rotary positional encoding (GeoRoPE) together with a transformer architecture as the backbone network. However, no ablation study of this architecture is conducted so it is unclear what is the impact of this architecture on performance. Could we just use a 3D graph neural network or gr
**Technical Quality**: The GeoROPE architecture is a creative and effective way to inject geometric information into the attention mechanism, combining relative positions (RoPE-3D) and pairwise distances (Nyström) into a single attention score. **Significance & Performance**: The model demonstrates exceptional performance, not just on standard benchmarks but also on a large-scale dataset (B3LYP) and a highly practical controllable generation task, showing SOTA results across all metrics for the
**Originality**: The use of a molecule's inertial frame as a canonical reference is a common solution to the $SE(3)$ invariance problem. **Robustness Not Addressed**: The paper does not discuss the stability of the inertial frame canonicalization. For symmetric molecules (degenerate eigenvalues) or flexible molecules (where small conformational changes could flip the axes), the token sequence could become unstable, which is a significant problem for an AR model. **Missing Ablation Studies**: T
1. The two-step canonicalization: aligning each molecule to an inertial frame with a deterministic sign convention, then applying a deterministic atom reordering removes SE(3) and permutation ambiguities without specialized equivariant networks. 2. It achieves SOTA or near-SOTA validity/stability on QM9 and GEOM-DRUG and shows big gains on the large B3LYP benchmark
1. To pick axis signs, the authors choose a “fourth node” (the atom farthest from the origin) and require it to lie in the first quadrant of the xy plane; this rule unambiguously fixes signs but could flip when the farthest atom changes under small perturbations, i.e., the frame is not continuous [2]. Same situations will happen when principal moments tie, small geometric changes can swap frames and thus token order 2. The authors does not situate this design within prior work on PCA-based pose
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Computational Drug Discovery Methods · Domain Adaptation and Few-Shot Learning
