# Size and structure of the sequence space of repeat proteins

**Authors:** Jacopo Marchi, Ezequiel A. Galpern, Rocio Espada, Diego U. Ferreiro,, Aleksandra M. Walczak, Thierry Mora

arXiv: 1905.04493 · 2020-11-20

## TL;DR

This paper estimates the size and explores the complex, rugged structure of the sequence space of repeat proteins, revealing hierarchical subtypes and correlations that influence diversity and have implications for protein design.

## Contribution

It introduces a maximum entropy modeling approach to quantify the sequence space of repeat proteins and uncovers the hierarchical, rugged landscape structure of their sequence diversity.

## Key findings

- A significant impact of amino acid correlations on sequence diversity
- Identification of a rugged, hierarchical landscape with multiple local minima
- Presence of subtypes within protein families based on sequence clustering

## Abstract

The coding space of protein sequences is shaped by evolutionary constraints set by requirements of function and stability. We show that the coding space of a given protein family--the total number of sequences in that family--can be estimated using models of maximum entropy trained on multiple sequence alignments of naturally occuring amino acid sequences. We analyzed and calculated the size of three abundant repeat proteins families, whose members are large proteins made of many repetitions of conserved portions of ~ 30 amino acids. While amino acid conservation at each position of the alignment explains most of the reduction of diversity relative to completely random sequences, we found that correlations between amino acid usage at different positions significantly impact that diversity. We quantified the impact of different types of correlations, functional and evolutionary, on sequence diversity. Analysis of the detailed structure of the coding space of the families revealed a rugged landscape, with many local energy minima of varying sizes with a hierarchical structure, reminiscent of fustrated energy landscapes of spin glass in physics. This clustered structure indicates a multiplicity of subtypes within each family, and suggests new strategies for protein design.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1905.04493/full.md

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/1905.04493/full.md

## References

47 references — full list in the complete paper: https://tomesphere.com/paper/1905.04493/full.md

---
Source: https://tomesphere.com/paper/1905.04493