A Categorical Compositional Distributional Modelling for the Language of Life
Yanying Wu, Quanlong Wang

TL;DR
This paper explores applying the Categorical Compositional Distributional (DisCoCat) model to proteins, representing their structure and function in a vector space framework to gain new biological insights.
Contribution
It introduces a novel approach to model protein functions using the DisCoCat framework, treating proteins as sentences and domains as words, linking biological functions with linguistic structures.
Findings
Proteins can be modeled as sentences with domains as words.
Protein functions are represented in vector spaces within the model.
The approach offers a new formalization of protein functional representation.
Abstract
The Categorical Compositional Distributional (DisCoCat) Model is a powerful mathematical model for composing the meaning of sentences in natural languages. Since we can think of biological sequences as the "language of life", it is attempting to apply the DisCoCat model on the language of life to see if we can obtain new insights and a better understanding of the latter. In this work, we took an initial step towards that direction. In particular, we choose to focus on proteins as the linguistic features of protein are the most prominent as compared with other macromolecules such as DNA or RNA. Concretely, we treat each protein as a sentence and its constituent domains as words. The meaning of a word or the sentence is just its biological function, and the arrangement of domains in a protein corresponds to the syntax. Putting all those into the DisCoCat framework, we can "compute" the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Genomics and Phylogenetic Studies · Fractal and DNA sequence analysis
