Scalable Spatio-Temporal SE(3) Diffusion for Long-Horizon Protein Dynamics
Nima Shoghi, Yuxuan Liu, Yuning Shen, Rob Brekelmans, Pan Li, Quanquan Gu

TL;DR
STAR-MD introduces a scalable, SE(3)-equivariant diffusion model with joint spatio-temporal attention, enabling long-horizon, microsecond-scale protein trajectory generation with high fidelity and structural validity.
Contribution
The paper presents STAR-MD, a novel causal diffusion transformer that efficiently models complex spatio-temporal dependencies for long-horizon protein dynamics simulation.
Findings
Achieves state-of-the-art performance on ATLAS benchmark.
Successfully generates stable microsecond protein trajectories.
Outperforms previous methods in conformational coverage and structural validity.
Abstract
Molecular dynamics (MD) simulations remain the gold standard for studying protein dynamics, but their computational cost limits access to biologically relevant timescales. Recent generative models have shown promise in accelerating simulations, yet they struggle with long-horizon generation due to architectural constraints, error accumulation, and inadequate modeling of spatio-temporal dynamics. We present STAR-MD (Spatio-Temporal Autoregressive Rollout for Molecular Dynamics), a scalable SE(3)-equivariant diffusion model that generates physically plausible protein trajectories over microsecond timescales. Our key innovation is a causal diffusion transformer with joint spatio-temporal attention that efficiently captures complex space-time dependencies while avoiding the memory bottlenecks of existing methods. On the standard ATLAS benchmark, STAR-MD achieves state-of-the-art performance…
Peer Reviews
Decision·ICLR 2026 Poster
1. Clear technical contribution: Combines SE(3)-equivariant frame modeling and continuous-time conditioning in a diffusion framework which conceptually elegant and well-motivated. 2. Strong empirical results: STAR-MD substantially outperforms prior coordinate-based models in conformational coverage, dynamic fidelity, and stability over long horizons. 3. Method clarity (core model): The paper’s architectural description and use of per-residue SE(3) frames are well explained. 4. Long-horizon evalu
1. Comparative clarity. It’s difficult to realize from the text that the baselines use very different geometric representations: AlphaFolding-4D is all-atom, MDGen and ConfRover are backbone-level, and STAR-MD is Cα-only. Because this difference likely affects both stability and metric scale, the paper should state it explicitly and discuss how Cα-level evaluation influences comparisons. 2. Alignment and Δt normalization. The paper does not describe how trajectories are aligned prior to RMSD/PC
- **Clarity**: Overall the paper is easy to follow (however, I believe the motivation for the approach could be explained better, see weaknesses below). - **Originality:** The paper makes a novel and original contribution in the area of protein dynamics by explicitly modeling long context windows and using joint spatio-temporal attention with conditioning noise to achieve long-term temporal generative modeling of molecular dynamics trajectories. However, the different components can be found els
- **Presentation of motivation:** In the main text, the authors argue without much explanation that long context and complex spatio-temporal modeling is necessary to accurately model molecular dynamics trajectories. Naively, it will be difficult to understand for most readers why that is, given that the original atomistic molecular dynamics data is essentially Markovian, corresponding to a simple integration of equations of motion (possibly with a thermostat/barostat). The reasoning for this is
1. The primary strength is the joint S$\times$T attention on singles-only features. This is a very clever architectural trade-off. It avoids the $\mathcal{O}(N^3)$ or $\mathcal{O}(N^2 L)$ complexity of competitors (AlphaFolding, ConfRover), which is the main barrier to long-horizon simulation. The KV cache analysis (Fig. 5) proves this is a massive practical win (e.g., 6.6MB vs 1.3GB per layer). 2. The "contextual noise perturbation" is the second key contribution. Autoregressive models are not
1. The paper's greatest strength is also its biggest unproven assumption. The model trades physical explicitness (i.e., operating on pair features) for computational speed. It bets that its S$\times$T attention is powerful enough to implicitly learn all the complex, long-range pairwise interactions (like allostery) just from single-residue features. While this seems to work for the ATLAS proteins, it is a major leap of faith that this will hold for more complex systems defined by subtle, coopera
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsProtein Structure and Dynamics · Model Reduction and Neural Networks · Generative Adversarial Networks and Image Synthesis
