# A Feature Engineering Method for Whole-Genome DNA Sequence with Nucleotide Resolution

**Authors:** Ting Wang, Yunpeng Cui, Tan Sun, Huan Li, Chao Wang, Ying Hou, Mo Wang, Li Chen, Jinming Wu

PMC · DOI: 10.3390/ijms26052281 · International Journal of Molecular Sciences · 2025-03-04

## TL;DR

This paper introduces FE-WDNA, a new method for analyzing whole-genome DNA sequences at the nucleotide level to improve plant trait predictions.

## Contribution

The novelty lies in using HyenaDNA to fine-tune a model for whole-genome feature engineering at nucleotide resolution, surpassing SNP-based methods.

## Key findings

- FE-WDNA captures long-range dependencies among nucleotides, improving trait prediction accuracy.
- The method outperforms traditional SNP-based approaches in agronomic trait prediction.
- It is adaptable to other plant species and applicable to various breeding tasks.

## Abstract

Feature engineering for whole-genome DNA sequences plays a critical role in predicting plant phenotypic traits. However, due to limitations in the models’ analytical capabilities and computational resources, the existing methods are predominantly confined to SNP-based approaches, which typically extract genetic variation sites for dimensionality reduction before feature extraction. These methods not only suffer from incomplete locus coverage and insufficient genetic information but also overlook the relationships between nucleotides, thereby restricting the accuracy of phenotypic trait prediction. Inspired by the parallels between gene sequences and natural language, the emergence of large language models (LLMs) offers novel approaches for addressing the challenge of constructing genome-wide feature representations with nucleotide granularity. This study proposes FE-WDNA, a whole-genome DNA sequence feature engineering method, using HyenaDNA to fine-tune it on whole-genome data from 1000 soybean samples. We thus provide deep insights into the contextual and long-range dependencies among nucleotide sites to derive comprehensive genome-wide feature vectors. We further evaluated the application of FE-WDNA in agronomic trait prediction, examining factors such as the context window length of the DNA input, feature vector dimensions, and trait prediction methods, achieving significant improvements compared to the existing SNP-based approaches. FE-WDNA provides a mode of high-quality DNA sequence feature engineering at nucleotide resolution, which can be transformed to other plants and directly applied to various computational breeding tasks.

## Full-text entities

- **Species:** Glycine max (soybean, species) [taxon 3847]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11899767/full.md

## Figures

14 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11899767/full.md

## References

45 references — full list in the complete paper: https://tomesphere.com/paper/PMC11899767/full.md

---
Source: https://tomesphere.com/paper/PMC11899767