# Gene sequence analysis model construction based on k-mer statistics

**Authors:** Dongjie Gao

PMC · DOI: 10.1371/journal.pone.0306480 · PLOS ONE · 2024-09-12

## TL;DR

This paper introduces a new gene sequence alignment model using k-mer statistics to improve analysis efficiency and performance.

## Contribution

A novel gene sequence alignment model and system based on k-mer statistics is proposed.

## Key findings

- The model's statistical power increases with sequence coverage and cutting length.
- The model's performance decreases with higher K value and module length.
- The system achieved a maximum storage of 71 GB and ran in under 2.0 seconds.

## Abstract

With the rapid development of biotechnology, gene sequencing methods are gradually improved. The structure of gene sequences is also more complex. However, the traditional sequence alignment method is difficult to deal with the complex gene sequence alignment work. In order to improve the efficiency of gene sequence analysis, D2 series method of k-mer statistics is selected to build the model of gene sequence alignment analysis. According to the structure of the foreground sequence, the sequence to be aligned can be cut by different lengths and divided into multiple subsequences. Finally, according to the selected subsequences, the maximum dissimilarity in the alignment results is determined as the statistical result. At the same time, the research also designed an application system for the sequence alignment analysis of the model. The experimental results showed that the statistical power of the sequence alignment analysis model was directly proportional to the sequence coverage and cutting length, and inversely proportional to the K value and module length. At the same time, the model was applied to the system designed in this paper. The maximum storage capacity of the system was 71 GB, the maximum disk capacity was 135 GB, and the running time was less than 2.0s. Therefore, the k-mer statistic sequence alignment model and system proposed in this study have considerable application value in gene alignment analysis.

## Full-text entities

- **Genes:** Trav6-3 (T cell receptor alpha variable 6-3) [NCBI Gene 328483] {aka Gm13948, Gm193, Gm4, TCR}
- **Diseases:** GBM (MESH:D005910), systemic lupus erythematosus (MESH:D008180), lung cancer disease (MESH:D008175), cancer (MESH:D009369), canine leptospirosis (MESH:D007922)
- **Chemicals:** acid (MESH:D000143), thymine (MESH:D013941), ribonucleotide (MESH:D012265), U (MESH:D014501), uracil (MESH:D014498)
- **Species:** Canis lupus familiaris (dog, subspecies) [taxon 9615], Homo sapiens (human, species) [taxon 9606], Petunia (petunia, genus) [taxon 4101], Human immunodeficiency virus 1 (no rank) [taxon 11676], Mus musculus (house mouse, species) [taxon 10090]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11392344/full.md

## Figures

12 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11392344/full.md

## References

18 references — full list in the complete paper: https://tomesphere.com/paper/PMC11392344/full.md

---
Source: https://tomesphere.com/paper/PMC11392344