Protein Lipograms
Jason Laurie, Amit K Chattopadhyay, Darren R Flower

TL;DR
This paper introduces the concept of protein lipograms, a novel linguistic approach to analyze protein sequences by omitting certain amino acids, revealing patterns related to metabolic cost and differentiating between major domains of life.
Contribution
It establishes a new lipogram-based framework for protein sequence analysis and demonstrates its potential in classifying proteomes across different domains of life.
Findings
Protein lipograms exhibit power-law properties.
A correlation exists between lipogram patterns and metabolic cost.
Lipogram analysis can differentiate proteomes of archaea, bacteria, eukaryotes, and viruses.
Abstract
Linguistic analysis of protein sequences is an underexploited technique. Here, we capitalize on the concept of the lipogram to characterize sequences at the proteome levels. A lipogram is a literary composition which omits one or more letters. A protein lipogram likewise omits one or more types of amino acid. In this article, we establish a usable terminology for the decomposition of a sequence collection in terms of the lipogram. Next, we characterize Uniref50 using a lipogram decomposition. At the global level, protein lipograms exhibit power-law properties. A clear correlation with metabolic cost is seen. Finally, we use the lipogram construction to differentiate proteomes between the four branches of the tree-of-life: archaea, bacteria, eukaryotes and viruses. We conclude from this pilot study that the lipogram demonstrates considerable potential as an additional tool for sequence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
