No Generation without Representation: Efficient Causal Protein Language Models Enable Zero-Shot Fitness Estimation

Furkan Eris

arXiv:2602.01845·cs.LG·February 3, 2026

No Generation without Representation: Efficient Causal Protein Language Models Enable Zero-Shot Fitness Estimation

Furkan Eris

PDF

Open Access 1 Models

TL;DR

Proust is a novel causal protein language model that combines the strengths of MLMs and generative models, achieving state-of-the-art results in fitness prediction and maintaining generative capabilities.

Contribution

The paper introduces Proust, a 309M-parameter causal PLM with architectural innovations enabling efficient training and superior performance on protein fitness benchmarks.

Findings

01

Achieves Spearman ρ=0.390 on ProteinGym substitutions

02

Sets new state-of-the-art on indels, outperforming larger models

03

Approaches structure-aware methods using sequence alone

Abstract

Protein language models (PLMs) face a fundamental divide: masked language models (MLMs) excel at fitness prediction while causal models enable generation, forcing practitioners to maintain separate architectures. We introduce \textbf{Proust}, a 309M-parameter causal PLM that bridges this gap through architectural innovations adapted from recent LLM research, including grouped-query attention with shared K/V projections, cross-layer value residuals, and depthwise causal convolutions. Trained on 33B tokens in 40 B200 GPU-hours, Proust achieves Spearman $ρ = 0.390$ on ProteinGym substitutions, competitive with MLMs requiring 50--200 $\times$ the compute. On indels, Proust sets a new state-of-the-art, outperforming models up to 20 $\times$ larger. On EVEREST viral fitness benchmarks, it approaches structure-aware methods using sequence alone. These powerful representations position Proust…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
nappenstance/proust_v0
model· ♡ 2
♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Bioinformatics · Artificial Intelligence in Healthcare and Education · Genomics and Rare Diseases