Estimation of genomic characteristics by analyzing k-mer frequency in de   novo genome projects

Binghang Liu; Yujian Shi; Jianying Yuan; Xuesong Hu; Hao Zhang; Nan; Li; Zhenyu Li; Yanxiang Chen; Desheng Mu; Wei Fan

arXiv:1308.2012·q-bio.GN·February 28, 2020

Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects

Binghang Liu, Yujian Shi, Jianying Yuan, Xuesong Hu, Hao Zhang, Nan, Li, Zhenyu Li, Yanxiang Chen, Desheng Mu, Wei Fan

PDF

2 Repos

TL;DR

This paper introduces an assembly-independent framework using k-mer frequency analysis to accurately estimate genome size, repeat content, and heterozygosity from sequencing data, aiding genome project planning and analysis.

Contribution

It presents novel techniques for modeling k-mer distributions and improves estimation accuracy over existing methods, addressing challenges of incomplete and complex genomes.

Findings

01

Enhanced estimation accuracy with new k-mer techniques

02

Effective handling of sequencing errors and biases

03

Applicable to diverse genomic and sequencing conditions

Abstract

Background: With the fast development of next generation sequencing technologies, increasing numbers of genomes are being de novo sequenced and assembled. However, most are in fragmental and incomplete draft status, and thus it is often difficult to know the accurate genome size and repeat content. Furthermore, many genomes are highly repetitive or heterozygous, posing problems to current assemblers utilizing short reads. Therefore, it is necessary to develop efficient assembly-independent methods for accurate estimation of these genomic characteristics. Results: Here we present a framework for modeling the distribution of k-mer frequency from sequencing data and estimating the genomic characteristics such as genome size, repeat structure and heterozygous rate. By introducing novel techniques of k-mer individuals, float precision estimation, and proper treatment of sequencing error and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.