A word recurrence based algorithm to extract genomic dictionaries

Vincenzo Bonnici; Giuditta Franco; Vincenzo Manca

arXiv:2009.10449·q-bio.GN·September 23, 2020

A word recurrence based algorithm to extract genomic dictionaries

Vincenzo Bonnici, Giuditta Franco, Vincenzo Manca

PDF

Open Access

TL;DR

This paper introduces a novel information theory-based algorithm that extracts significant variable-length genomic word dictionaries, revealing inter-chromosomal similarities in human genomes.

Contribution

It presents an innovative method combining conceptual and empirical analyses to extract genomic dictionaries based on information content, advancing genomic sequence analysis.

Findings

01

Identifies significant genomic word dictionaries of variable length

02

Reveals inter-chromosomal similarities in human genomes

03

Demonstrates effectiveness of the information theory approach

Abstract

Genomes may be analyzed from an information viewpoint as very long strings, containing functional elements of variable length, which have been assembled by evolution. In this work an innovative information theory based algorithm is proposed, to extract significant (relatively small) dictionaries of genomic words. Namely, conceptual analyses are here combined with empirical studies, to open up a methodology for the extraction of variable length dictionaries from genomic sequences, based on the information content of some factors. Its application to human chromosomes highlights an original inter-chromosomal similarity in terms of factor distributions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenomics and Phylogenetic Studies · RNA and protein synthesis mechanisms · Fractal and DNA sequence analysis