Protein generation with embedding learning for motif diversification
Kevin Michalewicz, Chen Jin, Philip Alexander Teare, Tom Diethe, Mauricio Barahona, Barbara Bravi, Asher Mullokandov

TL;DR
This paper introduces PGEL, a novel protein generation framework that uses embedding learning to achieve greater structural diversity while maintaining biological function, outperforming existing diffusion-based methods.
Contribution
PGEL is a new embedding-based approach that enhances motif diversity in protein design by perturbing high-dimensional embeddings within a diffusion model.
Findings
PGEL achieves higher structural diversity than partial diffusion.
PGEL maintains motif viability and designability.
PGEL demonstrates improved self-consistency across cases.
Abstract
A fundamental challenge in protein design is the trade-off between generating structural diversity while preserving motif biological function. Current state-of-the-art methods, such as partial diffusion in RFdiffusion, often fail to resolve this trade-off: small perturbations yield motifs nearly identical to the native structure, whereas larger perturbations violate the geometric constraints necessary for biological function. We introduce Protein Generation with Embedding Learning (PGEL), a general framework that learns high-dimensional embeddings encoding sequence and structural features of a target motif in the representation space of a diffusion model's frozen denoiser, and then enhances motif diversity by introducing controlled perturbations in the embedding space. PGEL is thus able to loosen geometric constraints while satisfying typical design metrics, leading to more diverse yet…
Peer Reviews
Decision·Submitted to ICLR 2026
* Adapting textual inversion from image generation to protein diffusion models is creative and represents a genuine methodological contribution. The idea of learning embeddings in a frozen model's representation space rather than directly perturbing coordinates is interesting. * The use of hierarchical clustering with TM-score thresholds and the requirement for distinguishability from native structures provides a reasonable assessment of structural diversity. * The authors transparently discuss
1. **The authors do not demonstrate that embedding learning is necessary.** The paper's core premise, that learning motif embeddings is essential, is never validated. No comparison to simpler baselines like random embeddings, embeddings from related structures, or zero embeddings during generation (only learning). This is the most critical missing experiment, as it questions whether the convoluted optimization procedure that requires a full structural denoising for a single gradient step, provid
The authors tackle an important problem of low diversity in protein design. The approach is flexible and can be used with different models.
### Key Weaknesses The method is too computationally expensive for practical use (requiring 2 hours of fine-tuning per protein for motifs up to 50 amino acids). Results and conclusions are drawn from only three selective examples. The authors flip the standard motif scaffolding problem on its head, which significantly complicates understanding of the paper. Furthermore, the comparison is limited to only one model (Partial Diffusion), despite the existence of numerous models that could perform t
It's a creative and simple idea, plus well adapted from textual inversion to protein diffusion, operating in embedding space rather than directly perturbing atomic coordinates. Motif diversification is a challenging and important problem in protein design, for both de-novo design and optimizing functional proteins. Across three test cases PGEL clearly demonstrates its superiority over using partial diffusion alone. As the authors note, the framework can be easily transferred to other protein dif
I’d like to mention a couple of points: First, a systematic analysis of the masking parameters and their effect on structural diversity would have been a valuable addition to the work. Second, ranking or categorizing motifs by type, for example structured single-motifs, structured multi-motif, and unstructured (loop) motifs, would have been both informative and interesting. And third, the method was only compared to partial diffusion in RFdiffusion, without benchmarking against other existing pr
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene Regulatory Network Analysis · Protein Structure and Dynamics · Genomics and Chromatin Dynamics
