# Estimation of (co)variance components for very large datasets and complex single-step genomic models

**Authors:** Matias Bermann, Andres Legarra, Ignacio Aguilar, Alejandra Alvarez-Munera, Ignacy Misztal, Daniela Lourenco

PMC · DOI: 10.1186/s12711-025-01006-9 · Genetics, Selection, Evolution : GSE · 2025-10-30

## TL;DR

This paper introduces a new method to estimate genetic variances in large datasets with genomic data, making the process faster and more efficient.

## Contribution

The novel contribution is extending Monte Carlo REML to handle genomic data in single-step models for large datasets.

## Key findings

- MC-ss-GREML estimates variance components accurately with large genomic datasets.
- MC-ss-GREML uses 14% of the computing time and 1% of the memory compared to exact methods.
- The method successfully converges in 11 rounds for a dataset with 7 million animals.

## Abstract

Variance components of linear mixed models should be estimated with all the data and information available for a specific statistical model to avoid bias. Due to computational limitations, the estimation for large datasets or complex models is often carried out by subsetting the data, removing genomic information, or simplifying the statistical model. Monte Carlo REML (MC-REML) is a method developed to lift computational limitations, but so far there was no extension for genomic information under the single-step genomic methods. In this study, we extended MC-REML to include large genomic information.

We developed a method to estimate variance components named Monte Carlo single-step genomic REML (MC-ss-GREML). The core of the method includes repeatedly simulating breeding values under a ssGBLUP model and solving the mixed model equations to approximate traces involving prediction error variances. The REML optimization strategies include Expectation Maximization and Average Information. We tested the accuracy of MC-ss-GREML with a three-trait growth model in beef cattle with maternal effects, with 14 parameters to estimate. The data set had 100,000 animals in the pedigree, of which about 33,000 had records, and 10,000 were genotyped. There were no differences in estimates between MC-ss-GREML and ss-GREML with the exact calculation of the traces (exact ss-GREML). MC-ss-GREML took 14% of the computing time and used 1% of the memory compared to the exact ss-GREML. We tested the computing performance of MC-ss-GREML by estimating variance components in a birth weight model, with much larger data that included 7 million animals in the pedigree, from which 330,000 were genotyped. The estimation converged in 11 rounds and took 121 h, with a peak memory usage of 53 Gb.

The new method, MC-ss-GREML, can estimate variance components with large datasets including many genotyped individuals, at affordable time and memory costs.

## Full-text entities

- **Chemicals:** MC (MESH:C061001)
- **Species:** Bos taurus (bovine, species) [taxon 9913]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12577077/full.md

## References

1 references — full list in the complete paper: https://tomesphere.com/paper/PMC12577077/full.md

---
Source: https://tomesphere.com/paper/PMC12577077