Lightweight MSA Design Advances Protein Folding From Evolutionary Embeddings

Hanqun Cao; Xinyi Zhou; Zijun Gao; Chenyu Wang; Xin Gao; Zhi Zhang; Cesar de la Fuente-Nunez; Chunbin Gu; Ge Liu; Pheng-Ann Heng

arXiv:2507.07032·cs.LG·September 29, 2025

Lightweight MSA Design Advances Protein Folding From Evolutionary Embeddings

Hanqun Cao, Xinyi Zhou, Zijun Gao, Chenyu Wang, Xin Gao, Zhi Zhang, Cesar de la Fuente-Nunez, Chunbin Gu, Ge Liu, Pheng-Ann Heng

PDF

3 Reviews

TL;DR

PLAME is a lightweight MSA design framework that improves protein structure prediction for low-homology proteins by leveraging pretrained language models and novel selection and quality metrics, enhancing existing folding tools.

Contribution

Introduces PLAME, a novel MSA generation method using evolutionary embeddings and a conservation-diversity loss, with strategies for candidate filtering and quality assessment, improving folding accuracy.

Findings

01

PLAME achieves state-of-the-art structure accuracy on low-homology benchmarks.

02

The selection strategy significantly improves the quality of MSAs.

03

PLAME enables ESMFold to reach near AlphaFold2 accuracy with faster inference.

Abstract

Protein structure prediction often hinges on multiple sequence alignments (MSAs), which underperform on low-homology and orphan proteins. We introduce PLAME, a lightweight MSA design framework that leverages evolutionary embeddings from pretrained protein language models to generate MSAs that better support downstream folding. PLAME couples these embeddings with a conservation--diversity loss that balances agreement on conserved positions with coverage of plausible sequence variation. Beyond generation, we develop (i) an MSA selection strategy to filter high-quality candidates and (ii) a sequence-quality metric that is complementary to depth-based measures and predictive of folding gains. On AlphaFold2 low-homology/orphan benchmarks, PLAME delivers state-of-the-art improvements in structure accuracy (e.g., lDDT/TM-score), with consistent gains when paired with AlphaFold3. Ablations…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper addresses an important problem of improving MSAs for low homology cases. 2. The proposed conservation–diversity objective is both intuitive and theoretically justified.

Weaknesses

1. The notations are not clear. (1) The meaning for different dimensions in $ \mathbf{H}_\mathrm{enc} $ is not given (except $N$ has been introduced before in L135). (2) Eq (5) does not specify which one was encoded from $\mathbf{H}_r$. (3) Which two axes are permuted in $\mathbf{X}_\mathrm{dec} ^\top $ ? (4) MSAs were denoted by $ \mathbf{M} $ in L135, while in L238, they were denoted by $M={m_1, \cdots, m_n}$. 2. The rationale of the proposed MSA selection method is not suffici

Reviewer 02Rating 6Confidence 2

Strengths

1.This study addresses MSA design, a crucial aspect of protein structure prediction. The research demonstrates thorough motivation, theoretical analysis, and comprehensive experiments, with the paper presenting a well-founded and complete work. 2.The experimental results shows superior performance on multiple tasks and baseline models. Weakness

Weaknesses

1.The proposed combined loss function is well theoretically motivated, however, it still lacks ablation study on different loss functions, since it is a proposed method in your study. 2.Similarly, I wonder the effectiveness of the MSA selection module. It would be better to include more ablation study.

Reviewer 03Rating 4Confidence 3

Strengths

1. Addresses an important limitation: lack of MSAs for low-homology proteins. 1. Introduces a clear conservation–diversity objective and a simple selection module (HiFiAD). 1. Evaluated on multiple structure predictors and standard benchmarks. 1. Offers practical computational efficiency compared with traditional MSA search. 1. Provides some transparency through ablations and limitations discussion.

Weaknesses

1. Methodological risk — unverified AI-generated data (major). PLAME uses embeddings from a pre-trained model (ESM-2) as the only source of evolutionary signal. These are AI-generated, unvalidated representations but are treated as if they contained true biological information. This undermines methodological soundness and risks propagating training-set biases. 1. Experimental results show that PLAME does not always outperform baselines; in several targets, accuracy decreases. This s

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.