Efficient generative modeling of protein sequences using simple autoregressive models
Jeanne Trinquier, Guido Uguzzoni, Andrea Pagnani, Francesco Zamponi,, Martin Weigt

TL;DR
This paper introduces simple autoregressive models for protein sequence generation that are highly accurate and computationally efficient, enabling better exploration of protein sequence space compared to more complex models.
Contribution
The authors propose and validate simple autoregressive models that match the performance of complex models at a fraction of the computational cost, with added mathematical advantages.
Findings
Models perform similarly to complex approaches but are 100-1000 times faster.
Able to estimate sequence probabilities and functional sequence space size.
Discovered approximately 10^68 possible sequences for response regulators.
Abstract
Generative models emerge as promising candidates for novel sequence-data driven approaches to protein design, and for the extraction of structural and functional information about proteins deeply hidden in rapidly growing sequence databases. Here we propose simple autoregressive models as highly accurate but computationally efficient generative sequence models. We show that they perform similarly to existing approaches based on Boltzmann machines or deep generative models, but at a substantially lower computational cost (by a factor between and ). Furthermore, the simple structure of our models has distinctive mathematical advantages, which translate into an improved applicability in sequence generation and evaluation. Within these models, we can easily estimate both the probability of a given sequence, and, using the model's entropy, the size of the functional sequence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
