# EasyGeSe – a resource for benchmarking genomic prediction methods

**Authors:** Carles Quesada-Traver, Daniel Ariza-Suarez, Bruno Studer, Steven Yates

PMC · DOI: 10.1186/s12864-025-12129-0 · BMC Genomics · 2025-10-24

## TL;DR

EasyGeSe is a tool that provides diverse datasets for benchmarking genomic prediction methods across species and traits.

## Contribution

EasyGeSe introduces a curated, standardized resource for benchmarking genomic prediction methods with multiple species and traits.

## Key findings

- Predictive performance varied significantly by species and trait, with Pearson’s correlation coefficients ranging from -0.08 to 0.96.
- Non-parametric models like random forest, LightGBM, and XGBoost showed modest but significant accuracy gains over parametric models.
- Non-parametric models also offered faster computation times and lower RAM usage compared to Bayesian alternatives.

## Abstract

Genomic prediction is a widely used method to predict phenotypes from genotypic data. Advances in both biological and computer science have enabled the generation of vast amounts of data and the development of new algorithms, specifically in the field of machine learning. However, systematic benchmarking of new genomic prediction methods, which is essential for objective evaluation and comparison, remains limited.

Here, we present EasyGeSe, a tool that provides access to a curated collection of datasets for testing genomic prediction methods. This resource encompasses data from multiple species, including barley, common bean, lentil, loblolly pine, eastern oyster, maize, pig, rice, soybean and wheat, representing a broad biological diversity. We filtered and arranged these data in convenient formats, provided functions in R and Python for easy loading and benchmarked several modelling strategies for genomic prediction. Predictive performance, measured by Pearson’s correlation coefficient (r), varied significantly by species and trait (p < 0.001), ranging from − 0.08 to 0.96, with a mean of 0.62. Comparisons among parametric, semi-parametric and non-parametric models revealed modest but statistically significant (p < 1e−10) gains in accuracy for the non-parametric methods random forest (+ 0.014), LightGBM (+ 0.021) and XGBoost (+ 0.025). These methods also offered major computational advantages, with model fitting times typically an order of magnitude faster and RAM usage approximately 30% lower than Bayesian alternatives. However, these measurements do not account for the computational costs of hyperparameter tuning.

By standardizing input data and evaluation procedures, this resource simplifies benchmarking and enables fair, reproducible comparisons of genomic prediction methods. It also broadens access to genomic prediction data, encouraging data scientists and interdisciplinary researchers to test novel modelling strategies.

The online version contains supplementary material available at 10.1186/s12864-025-12129-0.

## Full-text entities

- **Species:** Glycine max (soybean, species) [taxon 3847], Oryza sativa (Asian cultivated rice, species) [taxon 4530], Pinus taeda (loblolly pine, species) [taxon 3352], Crassostrea virginica (eastern oyster, species) [taxon 6565], Sus scrofa (pig, species) [taxon 9823], Lens culinaris (lentil, species) [taxon 3864]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12551357/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12551357/full.md

## References

3 references — full list in the complete paper: https://tomesphere.com/paper/PMC12551357/full.md

---
Source: https://tomesphere.com/paper/PMC12551357