TL;DR
This paper introduces an assembly-independent framework using k-mer frequency analysis to accurately estimate genome size, repeat content, and heterozygosity from sequencing data, aiding genome project planning and analysis.
Contribution
It presents novel techniques for modeling k-mer distributions and improves estimation accuracy over existing methods, addressing challenges of incomplete and complex genomes.
Findings
Enhanced estimation accuracy with new k-mer techniques
Effective handling of sequencing errors and biases
Applicable to diverse genomic and sequencing conditions
Abstract
Background: With the fast development of next generation sequencing technologies, increasing numbers of genomes are being de novo sequenced and assembled. However, most are in fragmental and incomplete draft status, and thus it is often difficult to know the accurate genome size and repeat content. Furthermore, many genomes are highly repetitive or heterozygous, posing problems to current assemblers utilizing short reads. Therefore, it is necessary to develop efficient assembly-independent methods for accurate estimation of these genomic characteristics. Results: Here we present a framework for modeling the distribution of k-mer frequency from sequencing data and estimating the genomic characteristics such as genome size, repeat structure and heterozygous rate. By introducing novel techniques of k-mer individuals, float precision estimation, and proper treatment of sequencing error and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
