Sparse Autoencoders for Low-$N$ Protein Function Prediction and Design

Darin Tsui; Kunal Talreja; Amirali Aghazadeh

arXiv:2508.18567·cs.LG·August 27, 2025

Sparse Autoencoders for Low-$N$ Protein Function Prediction and Design

Darin Tsui, Kunal Talreja, Amirali Aghazadeh

PDF

1 Models

TL;DR

This paper demonstrates that sparse autoencoders trained on limited protein sequence data can effectively predict protein function and guide design, outperforming baseline models and capturing meaningful biological features.

Contribution

It systematically evaluates the effectiveness of sparse autoencoders on low-$N$ protein function prediction using fine-tuned ESM2 embeddings, showing their superiority in data-scarce regimes.

Findings

01

SAEs outperform or match ESM2 baselines with as few as 24 sequences.

02

SAEs encode compact, biologically meaningful representations.

03

Steering latent spaces yields high-fitness protein variants in 83% of cases.

Abstract

Predicting protein function from amino acid sequence remains a central challenge in data-scarce (low- $N$ ) regimes, limiting machine learning-guided protein design when only small amounts of assay-labeled sequence-function data are available. Protein language models (pLMs) have advanced the field by providing evolutionary-informed embeddings and sparse autoencoders (SAEs) have enabled decomposition of these embeddings into interpretable latent variables that capture structural and functional features. However, the effectiveness of SAEs for low- $N$ function prediction and protein design has not been systematically studied. Herein, we evaluate SAEs trained on fine-tuned ESM2 embeddings across diverse fitness extrapolation and protein engineering tasks. We show that SAEs, with as few as 24 sequences, consistently outperform or compete with their ESM2 baselines in fitness prediction,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
ktalreja/LowNSAE
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.