Clustering genomic words in human DNA using peaks and trends of   distributions

Ana Helena Tavares; Jakob Raymaekers; Peter J. Rousseeuw; Paula Brito,; Vera Afreixo

arXiv:1808.04278·stat.AP·January 13, 2021·Adv. Data Anal. Classif.

Clustering genomic words in human DNA using peaks and trends of distributions

Ana Helena Tavares, Jakob Raymaekers, Peter J. Rousseeuw, Paula Brito,, Vera Afreixo

PDF

TL;DR

This paper introduces a novel clustering method for genomic words in human DNA based on their inter-word lag distribution peaks and trends, effectively grouping words with similar distribution patterns.

Contribution

The work presents a new clustering approach that decomposes distributions into baseline and peak components using robust fitting, tailored for spiked genomic data.

Findings

01

Effective grouping of distributions with similar peak behavior

02

Application to human genome reveals patterns among 3- and 5-length words

03

Potential for identifying words with similar distance patterns

Abstract

In this work we seek clusters of genomic words in human DNA by studying their inter-word lag distributions. Due to the particularly spiked nature of these histograms, a clustering procedure is proposed that first decomposes each distribution into a baseline and a peak distribution. An outlier-robust fitting method is used to estimate the baseline distribution (the `trend'), and a sparse vector of detrended data captures the peak structure. A simulation study demonstrates the effectiveness of the clustering procedure in grouping distributions with similar peak behavior and/or baseline features. The procedure is applied to investigate similarities between the distribution patterns of genomic words of lengths 3 and 5 in the human genome. These experiments demonstrate the potential of the new method for identifying words with similar distance patterns.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.