UNAAGI: Atom-Level Diffusion for Generating Non-Canonical Amino Acid Substitutions
Han Tang, Wouter Boomsma

TL;DR
UNAAGI is a diffusion-based model that predicts amino acid substitutions at the atomic level, including non-canonical amino acids, improving mutation effect prediction and enabling broader protein engineering applications.
Contribution
Introduces UNAAGI, a novel E(3)-equivariant diffusion model for residue reconstruction that includes non-canonical amino acids, expanding the scope of structure-based mutation prediction.
Findings
Outperforms existing methods on NCAA substitution benchmarks.
Enables exploration of non-canonical amino acids in protein design.
Suggests a unified approach for protein engineering and drug design.
Abstract
Proposing beneficial amino acid substitutions, whether for mutational effect prediction or protein engineering, remains a central challenge in structural biology. Recent inverse folding models, trained to reconstruct sequences from structure, have had considerable impact in identifying functional mutations. However, current approaches are constrained to designing sequences composed exclusively of natural amino acids (NAAs). The larger set of non-canonical amino acids (NCAAs), which offer greater chemical diversity, and are frequently used in in-vivo protein engineering, remain largely inaccessible for current variant effect prediction methods. To address this gap, we introduce \textbf{UNAAGI}, a diffusion-based generative model that reconstructs residue identities from atomic-level structure using an E(3)-equivariant framework. By modeling side chains in full atomic detail rather than…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper addresses an important and largely unexplored problem of extending variant effect prediction to non-canonical amino acids. 2. The atomic-level generation approach is conceptually sound. 3. The connection between structure-based drug design and protein engineering is insightful. Recognizing that both domains involve modeling non-covalent interactions in protein contexts and could share methodological tools is a valuable observation that may inspire future work. 4. The virtual node pa
1. **The method fails to demonstrate meaningful NCAA prediction capability despite being its core contribution**. The model only successfully samples a small fraction of the 20 NCAAs in the benchmark and admits it "tends to interpolate between canonical-like structures" rather than generating chemically distinct non-canonical amino acids. This undermines the entire premise of the work. 2. **The experimental scale is too small to evaluate the approach**. Training on only 1,000 PDB structures wi
1. Shifting from discrete symbols (tokens) prediction to continuous, atom-by-atom side-chain generation is an interesting work. Traditional inverse folding models (such as ProteinMPNN and ESM-IF1) are fundamentally limited by their fixed output vocabulary. UNAAGI cleverly circumvents this constraint by modeling the underlying atomic coordinates directly. 2. Choosing an E(3)-equivariant Graph Neural Network architecture is a methodologically sound and principled choice. Since molecular data is in
1. The experimental evaluation strategy has significant weaknesses, which undermine the credibility of UNAAGI's performance conclusions on natural amino acids. The practice of subset selection based on protein length lacks sufficient justification and may introduce systematic bias. This leads to selection bias, as small proteins are more likely to consist of globular proteins with a single domain, whose properties are mainly determined by local interactions. UNAAGI is a model that relies on loca
1. **Novel Generative Task**: The inclusion of non-canonical amino acid substitution as a generative task is a significant and novel contribution. This method expands the scope of computational protein design beyond the traditional 20 canonical residues, opening promising new avenues for exploring protein functionality, stability, and therapeutic applications using synthetic or engineered residues.
1. **Questionable Methodological Novelty**: The proposed UNAAGI model appears to be a combination of existing methodologies, and the manuscript does not sufficiently elucidate the key, specific technical innovations that distinguish this model from its predecessors in the context of the presented tasks. 2. **Insufficient Experimental Baselines**: The experimental evaluation is limited by a narrow selection of baseline methods. it omits comparisons with several highly relevant state-of-the-
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRNA and protein synthesis mechanisms · Protein Structure and Dynamics · Machine Learning in Materials Science
