Clustering genomic words in human DNA using peaks and trends of distributions
Ana Helena Tavares, Jakob Raymaekers, Peter J. Rousseeuw, Paula Brito,, Vera Afreixo

TL;DR
This paper introduces a novel clustering method for genomic words in human DNA based on their inter-word lag distribution peaks and trends, effectively grouping words with similar distribution patterns.
Contribution
The work presents a new clustering approach that decomposes distributions into baseline and peak components using robust fitting, tailored for spiked genomic data.
Findings
Effective grouping of distributions with similar peak behavior
Application to human genome reveals patterns among 3- and 5-length words
Potential for identifying words with similar distance patterns
Abstract
In this work we seek clusters of genomic words in human DNA by studying their inter-word lag distributions. Due to the particularly spiked nature of these histograms, a clustering procedure is proposed that first decomposes each distribution into a baseline and a peak distribution. An outlier-robust fitting method is used to estimate the baseline distribution (the `trend'), and a sparse vector of detrended data captures the peak structure. A simulation study demonstrates the effectiveness of the clustering procedure in grouping distributions with similar peak behavior and/or baseline features. The procedure is applied to investigate similarities between the distribution patterns of genomic words of lengths 3 and 5 in the human genome. These experiments demonstrate the potential of the new method for identifying words with similar distance patterns.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
