Probabilistic Models of k-mer Frequencies (Extended Abstract)
Askar Gafurov, Tom\'a\v{s} Vina\v{r}, Bro\v{n}a Brejov\'a

TL;DR
This paper reviews probabilistic models for k-mer frequencies in DNA sequencing, highlighting their ability to infer genomic properties and compare samples, with a focus on modeling dependencies and errors.
Contribution
It provides a comprehensive overview of existing probabilistic models for k-mer abundance, emphasizing their applications in estimating genomic features and sample comparisons.
Findings
Models capture dependence on genome size, repeats, heterozygosity, and errors.
They enable estimation of genomic properties from k-mer histograms.
Discussion on comparing k-mer abundances between samples.
Abstract
In this article, we review existing probabilistic models for modeling abundance of fixed-length strings (k-mers) in DNA sequencing data. These models capture dependence of the abundance on various phenomena, such as the size and repeat content of the genome, heterozygosity levels, and sequencing error rate. This in turn allows to estimate these properties from k-mer abundance histograms observed in real data. We also briefly discuss the issue of comparing k-mer abundance between related sequencing samples and meaningfully summarizing the results.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
