A Diffusion Model to Shrink Proteins While Maintaining Their Function
Ethan Baron, Alan N. Amin, Ruben Weitzman, Debora Marks, Andrew Gordon Wilson

TL;DR
This paper introduces SCISOR, a discrete diffusion model that effectively shortens protein sequences by deleting letters while preserving function, outperforming previous models in generating realistic, functional proteins.
Contribution
SCISOR is the first discrete diffusion model designed for protein sequence deletion, enabling efficient, realistic shrinking of proteins while maintaining their biological function.
Findings
SCISOR achieves state-of-the-art predictions of deletion effects.
Deletions suggested by SCISOR preserve functional motifs more effectively.
SCISOR generates more realistic proteins compared to previous models.
Abstract
Many proteins useful in modern medicine or bioengineering are challenging to make in the lab, fuse with other proteins in cells, or deliver to tissues in the body, because their sequences are too long. Shortening these sequences typically involves costly, time-consuming experimental campaigns. Ideally, we could instead use modern models of massive databases of sequences from nature to learn how to propose shrunken proteins that resemble sequences found in nature. Unfortunately, these models struggle to efficiently search the combinatorial space of all deletions, and are not trained with inductive biases to learn how to delete. To address this gap, we propose SCISOR, a novel discrete diffusion model that deletes letters from sequences to generate protein samples that resemble those found in nature. To do so, SCISOR trains a de-noiser to reverse a forward noising process that adds random…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Fractal and DNA sequence analysis · Protein Structure and Dynamics
