Probabilistic Models of k-mer Frequencies (Extended Abstract)

Askar Gafurov; Tom\'a\v{s} Vina\v{r}; Bro\v{n}a Brejov\'a

arXiv:2112.15107·q-bio.QM·January 3, 2022·CiE

Probabilistic Models of k-mer Frequencies (Extended Abstract)

Askar Gafurov, Tom\'a\v{s} Vina\v{r}, Bro\v{n}a Brejov\'a

PDF

TL;DR

This paper reviews probabilistic models for k-mer frequencies in DNA sequencing, highlighting their ability to infer genomic properties and compare samples, with a focus on modeling dependencies and errors.

Contribution

It provides a comprehensive overview of existing probabilistic models for k-mer abundance, emphasizing their applications in estimating genomic features and sample comparisons.

Findings

01

Models capture dependence on genome size, repeats, heterozygosity, and errors.

02

They enable estimation of genomic properties from k-mer histograms.

03

Discussion on comparing k-mer abundances between samples.

Abstract

In this article, we review existing probabilistic models for modeling abundance of fixed-length strings (k-mers) in DNA sequencing data. These models capture dependence of the abundance on various phenomena, such as the size and repeat content of the genome, heterozygosity levels, and sequencing error rate. This in turn allows to estimate these properties from k-mer abundance histograms observed in real data. We also briefly discuss the issue of comparing k-mer abundance between related sequencing samples and meaningfully summarizing the results.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.