No Generation without Representation: Efficient Causal Protein Language Models Enable Zero-Shot Fitness Estimation
Furkan Eris

TL;DR
Proust is a novel causal protein language model that combines the strengths of MLMs and generative models, achieving state-of-the-art results in fitness prediction and maintaining generative capabilities.
Contribution
The paper introduces Proust, a 309M-parameter causal PLM with architectural innovations enabling efficient training and superior performance on protein fitness benchmarks.
Findings
Achieves Spearman ρ=0.390 on ProteinGym substitutions
Sets new state-of-the-art on indels, outperforming larger models
Approaches structure-aware methods using sequence alone
Abstract
Protein language models (PLMs) face a fundamental divide: masked language models (MLMs) excel at fitness prediction while causal models enable generation, forcing practitioners to maintain separate architectures. We introduce \textbf{Proust}, a 309M-parameter causal PLM that bridges this gap through architectural innovations adapted from recent LLM research, including grouped-query attention with shared K/V projections, cross-layer value residuals, and depthwise causal convolutions. Trained on 33B tokens in 40 B200 GPU-hours, Proust achieves Spearman on ProteinGym substitutions, competitive with MLMs requiring 50--200 the compute. On indels, Proust sets a new state-of-the-art, outperforming models up to 20 larger. On EVEREST viral fitness benchmarks, it approaches structure-aware methods using sequence alone. These powerful representations position Proust…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Artificial Intelligence in Healthcare and Education · Genomics and Rare Diseases
