On the Natural Structure of Amino Acid Patterns in Families of Protein Sequences
Pablo Turjanski, Diego U. Ferreiro

TL;DR
This paper introduces a precise method for identifying and classifying amino acid patterns in protein sequences, revealing that short repetitions significantly distinguish natural protein families from random sequences.
Contribution
It provides a mathematically rigorous, parameter-free algorithm to detect and analyze pattern recurrence, enhancing understanding of protein family structures.
Findings
Short repetitions distinguish natural families from random sequences by over 10 standard deviations.
Patterns shorter than 5 residues are effectively random.
A small subset of patterns can define sequence familiarity robustly.
Abstract
All known terrestrial proteins are coded as continuous strings of ~20 amino acids. The patterns formed by the repetitions of elements in groups of finite sequences describes the natural architectures of protein families. We present a method to search for patterns and groupings of patterns in protein sequences using a mathematically precise definition for 'repetition', an efficient algorithmic implementation and a robust scoring system with no adjustable parameters. We show that the sequence patterns can be well-separated into disjoint classes according to their recurrence in nested structures. The statistics of pattern occurrences indicate that short repetitions are enough to account for the differences between natural families and randomized groups by more than 10 standard deviations, while patterns shorter than 5 residues are effectively random. A small subset of patterns is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Advanced Proteomics Techniques and Applications · Machine Learning in Bioinformatics
