Randomization across breeding cohorts improves the accuracy of conventional and genomic selection

Arlyn Ackerman; Jessica Rutkoski

PMC · DOI:10.1002/tpg2.70218·March 16, 2026

Randomization across breeding cohorts improves the accuracy of conventional and genomic selection

Arlyn Ackerman, Jessica Rutkoski

PDF

Open Access

TL;DR

Randomizing breeding groups across trials improves selection accuracy, especially when genomic data is limited.

Contribution

The study introduces a novel comparison of randomization strategies in breeding trials using simulations.

Findings

01

Complete randomization improved BLUP accuracy by 11.7% under low genetic correlation.

02

Genomic data reduced the impact of randomization design on selection accuracy.

03

Sparse genomic testing showed a 5.5% accuracy improvement under challenging conditions.

Abstract

Breeding programs conventionally evaluate cohorts in separate trials; however, environmental differences across testing areas can be confounded with genetic differences between cohorts, potentially reducing the accuracy of breeding value estimation. We test whether the conventional approach of restricting randomization of cohorts to within trials reduces genomic and conventional selection accuracy when compared to the complete randomization of all cohorts across a trial, using in silico simulation with marker data from University of Illinois winter wheat breeding lines. We evaluated selection accuracy for conventional best linear unbiased prediction (BLUP), genomic BLUP (GBLUP), and genomic‐enabled sparse testing across a comprehensive simulation space spanning narrow‐sense heritabilities of 0.2–0.8, genetic correlations between testing areas from 0.2 to 1.0, and three replication…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Chemicals1

CR

Diseases3

GRM CR BLUP

Figures9

Click any figure to enlarge with its caption.

Depiction of the testing designs and hierarchy of genetic correlations simulated. A whole environment is represented with a grey rectangle, while areas within environments are represented by the two colored rectangles within each grey rectangle. Five environments were simulated with varying levels of genetic correlation between environments, rj1j2. Under restricted randomization (RR), the areas within environment represent two separate trials with preliminary cohorts, S1 and S2, randomized separately from advanced cohorts, S3 and S4. Under complete randomization (CR), areas within environment represent blocks of a single trial with preliminary and advanced cohorts randomized together. Five levels of genetic correlation between areas were evaluated (rk1k2 = 1.0, 0.8, 0.6, 0.4, or 0.2). Lower rk1k2 values indicate stronger genotype‐by‐area interaction, which may occur due to spatial separation or differential management of the areas.

Corresponding design configurations for each replication level. Design configurations are shown as heatmaps for a single simulation run. Cohorts are color‐coded with S1, S2, S3, and S4 corresponding to dark blue, light blue, dark orange, and light orange, respectively. Replication level, replication per cohort, total test plots per location, and total check plots per environment are displayed next to each design. Comparable pairs of design configurations are shown side by side, with restricted randomization (RR) shown on the left and complete randomization (CR) shown on the right. With RR, preliminary cohorts S1 and S2 are randomized together to simulate a preliminary yield trial (PYT) while advanced cohorts S3 and S4 are randomized together to simulate an advanced yield trial (AYT). Each pair shares constant test plot numbers, n, and maintain eight checks per block across all designs. Black boxes indicate a single area and pairwise genetic correlations (rk1k2) were simulated between these areas. Note that in the high replication scenario, RR and CR had two and three different areas, respectively. In contrast, both RR and CR had equal numbers of areas in the intermediate and low replication scenarios.

Principal component analysis based on the genomic relationship matrix. Principal component one (PC1) and principal component two (PC2) computed from the additive genomic relationship matrix of the 4102 wheat lines used in this study are plotted on the x and y‐axis, respectively. The first two principal components explain 5.8% and 5% of the variation. Lines are color‐coded indicating to which cohort they belong. Cohort S1, S3, S3, and S4 are indicated with dark blue circles, light blue diamonds, red squares, and yellow triangles, respectively. Lines clustering with the preliminary cohorts (S1 and S2) are circled in blue, and lines clustering with the advanced cohorts (S3 and S4) are circled in orange.

Conventional selection accuracy across experimental factors. Box‐and‐whisker plots show distributions of best linear unbiased prediction (BLUP) accuracies across 500 simulations in restricted randomization (RR, blue) or complete randomization (CR, orange) as rbv decreases along the y‐axis. Panels are arranged by narrow‐sense heritability within environment hj2, which ranged from 0.8 to 0.2, and replication level, which was either low, intermediate, or high. Genetic correlation between areas, rk1k2, is shown decreasing across the x‐axis, representing increasing genotype‐by‐area interaction.

Difference‐in‐differences (DiD) analysis of conventional selection accuracies. Accuracy of best linear unbiased prediction (BLUP), rby, is shown along the y‐axis for restricted randomization (RR, blue; control) and complete randomization (CR, orange; intervention) as experimental conditions progressively deteriorate from an optimal baseline, moving from left to right on the x‐axes. Narrow‐sense heritability within environment, hj2, panel A; Deterioration in replication level, panel B, and genetic correlation between areas rk1k2, panel C is examined individually while keeping the other parameters at their optimal levels. Light lines denote individual simulation runs (n = 500 per scenario), while bold lines show the overall mean trend of the given design. The vertical dotted line (“baseline”) marks the optimal factor combination (hj2 = 0.8, high replication, rk1k2 = 1), at which RR and CR perform equivalently (rbv ≈ 0.97), supporting the parallel‐trends condition for causal interpretation. Shaded regions between design lines illustrate the magnitude of difference in rbv , with changes in the width of these regions relative to baseline visualizing the DiD effect.

Genomic selection accuracy under low replication across experimental factors. Box‐and‐whisker plots show the distributions of genomic best linear unbiased prediction (GBLUP) accuracies across 50 simulations in restricted randomization (RR, blue) or complete randomization (CR, orange) with rbv along the y‐axis. Panels are arranged by narrow‐sense heritability within environment, hj2, ranging from 0.8 to 0.2. Genetic correlation between areas, rk1k2, is shown decreasing across the x‐axis, representing increasing genotype‐by‐area interaction.

Difference‐in‐differences analysis of genomic selection under low replication. Accuracy of genomic best linear unbiased prediction (GBLUP), or rby , is shown along the y‐axis for restricted randomization (RR, orange) and complete randomization (CR, blue) designs. Light lines represent individual simulation runs (n = 50 per scenario), while bold lines show the overall mean trends of either design. Black error bars at the left side of the plot display heritability‐specific means ± standard deviation, with corresponding within environment narrow sense heritability, hj2, values labeled on the left. The vertical dotted line marks the optimal baseline conditions where both designs achieve near‐equivalent performance, supporting parallel trends assumptions for causal inference. Shaded regions between design trendlines illustrate the magnitude of difference in rbv as the genetic correlation between areas, rk1k2, progressively deteriorates along the x‐axis past the intervention point, with minimal change in the width of these regions from baseline, illustrating the non‐significant difference‐in‐differences (DiD) effect.

Genomic selection accuracy of sparse testing designs under low replication across experimental factors. Box‐and‐whisker plots show distribution of genomic best linear unbiased prediction (GBLUP) accuracies, rbv , across 50 simulations in restricted randomization (RR, blue) or complete randomization (CR, orange) with rbv along the y‐axis, where, for a given run: S3 and S4 cohorts are evaluated in all environments, S1 and S2 cohorts are evaluated in only one of five environments, and rbv is calculated based on the predictive ability of GBLUP for S1 and S2 cohorts in the four untested environments. Panels are arranged by narrow‐sense heritability within environment, hj2, which ranged from 0.8 to 0.2. Genetic correlation between areas, rk1k2, decreases across the x‐axis, representing increasing genotype‐by‐area interaction.

Difference‐in‐differences analysis of genomic selection accuracy for sparse testing designs under low replication. Accuracy of genomic best linear unbiased prediction (GBLUP), rby , is shown along the y‐axis for restricted randomization (RR, orange) and complete randomization (CR, blue); where, for a given run: S3 and S4 cohorts are evaluated in all environments; S1 and S2 cohorts are evaluated in only one of five environments, and rbv is calculated based on the selection accuracy of GBLUP for S1 and S2 cohorts in the four untested environments. Light lines represent individual simulation runs (n = 50 per scenario) for CR and RR designs, while bold lines show overall mean trends. Black error bars at the left side of the plot display heritability‐specific means ± standard deviation, with corresponding within‐environment heritability ,hj2, values labeled on the left. The vertical dotted line marks the optimal baseline conditions where both designs achieve near‐equivalent performance, supporting parallel trends assumptions for causal inference. Shaded regions between design trendlines illustrate the magnitude of difference in rbv as genetic correlation between areas, rk1k2, progressively deteriorates along the x‐axis past the intervention point, with the widening of this region from baseline visualizing the significant difference‐in‐differences (DiD) effect.

Tables4

TABLE 1. Description of breeding cohorts.

Cohorts	Number of lines	Designation
Stage 1	1204	Preliminary
Stage 2	1519	Preliminary
Stage 3	1151	Advanced
Stage 4	228	Advanced

TABLE 2. Description of trials simulated at each location.

Randomization scheme	Design parameter set	Replication	Trial	Cohorts included	Trial design type	Blocks	Replicates per line within cohort
Randomization scheme	Design parameter set	Replication	Trial	Cohorts included	Trial design type	Blocks	S1	S2	S3	S4
Restricted	1	High	Prelim	S1, S2	RCBD	2	2	2	0	0
			Adv	S3, S4	IBD	3	0	0	2	3
	2	Intermediate	Prelim	S1, S2	RCBD	2	2	2	0	0
			Adv	S3, S4	RCBD	2	0	0	2	2
	3	Low	Prelim	S1, S2	CRD	1	1	1	0	0
			Adv	S3, S4	RCBD	2	0	0	2	2
Complete	4	High	Multi‐cohort	S1, S2, S3, S4	IBD	3	2	2	2	3
	5	Intermediate	Multi‐cohort	S1, S2, S3, S4	RCBD	2	2	2	2	2
	6	Low	Multi‐cohort	S1, S2, S3, S4	p‐rep	2	1	1	2	2

TABLE 3. Multi‐factor analysis of variance (ANOVA) tables examining the factors affecting selection accuracy in our simulations with all lines evaluated in all six environments.

Selection method	Source	Degrees of freedom	Mean squares	F‐value	Significance
Conventional selection	$h_{j}^{2}$	3	335.0	83,902.1	<2e‐16
	$r_{k_{1} k_{2}}$	4	28.2	7053.7	<2e‐16
	Randomization scheme	1	100.8	25,237.8	<2e‐16
	Replication level	2	49.2	12,323.8	<2e‐16
	$h_{j}^{2}$ × Randomization scheme	2	1.4	356.9	<2e‐16
	$r_{k_{1} k_{2}}$ × Randomization scheme	3	5.0	1258.9	<2e‐16
	Replication level × randomization scheme	2	1.3	320.2	<2e‐16
	Error	59,979	0.004	–	–
Genomic selection	$h_{j}^{2}$	3	1.0664	2292.26	< 2e‐16
	$r_{k_{1} k_{2}}$	4	0.4182	898.963	< 2e‐16
	Randomization scheme	1	0.0048	10.333	0.00133
	$h_{j}^{2}$ × Randomization scheme	3	0.003	0.687	0.55995
	$r_{k_{1} k_{2}}$ × Randomization scheme	4	0.0010	2.231	0.06339
	Error	1983	0.0005	–	–
Genomic selection with sparse testing	$h_{j}^{2}$	3	2.1638	826.093	< 2e‐16
	$r_{k_{1} k_{2}}$	4	0.4391	167.635	< 2e‐16
	Randomization scheme	1	0.1108	42.307	9.11e‐11
	$h_{j}^{2}$ × Randomization scheme	3	0.0021	0.801	0.493
	$r_{k_{1} k_{2}}$ × Randomization scheme	4	0.0186	7.087	1.12e‐05
	Error	2983	0.0026	–	–

TABLE 4. Difference‐in‐differences (DiD) coefficients and their corresponding p‐values showing which simulation parameters significantly impacted the difference in performance between randomization schemes for a given selection method.

Selection method	Source	DiD coefficient	p‐value
Conventional selection	$h_{j}^{2}$	−0.002	0.726
	Replication level	0.005	1.64e‐05
	$r_{k_{1} k_{2}}$	0.082	<2e‐16
Genomic selection	$r_{k_{1} k_{2}}$	0.002	0.653
Genomic selection with sparse testing	$r_{k_{1} k_{2}}$	0.018	0.07

Funding4

—U.S. Wheat and Barley Scab Initiative
—Siemer Milling Company
—Department of Crop Sciences at the University of Illinois Urbana‐Champaign
—USDA, National Institute of Food and Agriculture

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenetic and phenotypic traits in livestock · Genetic Associations and Epidemiology · Genetic Mapping and Diversity in Plants and Animals

Full text

INTRODUCTION

1

Yield trials are a cornerstone of breeding programs, essential for achieving genetic gain through selection, but often requiring significant resources. Yield trials are conducted to generate phenotypic data on breeding materials, such as lines, families, and hybrids, to support selection decisions and train predictive statistical models. However, phenotypes are a culmination of both genetic and non‐genetic factors, and in the case of yield, most of the phenotypic variation is non‐genetic. Plant breeders who make selection decisions based on total genetic values or breeding values therefore must employ statistical designs appropriate for yield trials to facilitate the separation of genetic from non‐genetic effects. To ensure accurate estimates of genetic or breeding values, replication within and across environments is crucial, as is blocking and proper research plot technique. We define a single yield trial as one that takes place in a unique location‐year combination and is developed using a single randomization process and entry list. In breeding programs, multiple yield trials are often conducted across environments as multi‐environment yield trials and ultimately analyzed together to obtain more accurate estimates of genetic or breeding values.

To adequately sample environments, yield testing is conducted across years, in which only the best 15%–30% of breeding materials will be advanced for testing the following year. To accommodate the selection and advancement of breeding materials through yield trials, breeding programs utilize formalized multi‐environment testing processes. A consistent process is applied to each cohort of breeding materials that enters multi‐location testing. A cohort is defined as a set of genotypes within a specific selection stage of the advancement process (Covarrubias‐Pazaran et al., 2022). In yield trials, cohorts are often referred to by their stage of testing or by their selection cycles (Cullis et al., 2020). For example, genotypes in their first year of yield testing may be referred to as stage one (S1). Genotypes in their second, third, and fourth years of yield testing may be referred to as stage two (S2), stage three (S3), and stage four (S4), respectively (A. Smith et al., 2021). Genotypes belonging to the same cohort are typically generated from the same crossing block; therefore, they are expected to be more closely related to each other compared to genotypes from different cohorts. Typically, cohorts are evaluated in separate yield trials; however, two or more cohorts can sometimes be evaluated in the same trial. One common testing approach is to evaluate S1 genotypes together in what is known as an observation yield trial, S2 genotypes together in what is known as a preliminary yield trial, and finally, S3 and S4 genotypes together in what is known as an advanced yield trial. Another common practice is to evaluate S1 and S2 genotypes together within a preliminary yield trial, followed by an advanced yield trial. Regardless, separating cohorts into different trials effectively restricts randomization to within groups of genotypes that are more closely related.

For each individual yield trial, design factors such as the replication per entry and the proportion of genotypes selected for further testing generally remain consistent from year to year within the breeding program. The replication of test entries scales with advancement through the trial stages: entries in observation trials are typically unreplicated, those in preliminary yield trials may be unreplicated or replicated twice, and advanced yield trials generally employ two or three replicates (Bernardo, 2010). Selection for advancement to the next stage is performed within a cohort, and evaluating cohorts in separate trials is intended to improve this within‐cohort selection accuracy.

In addition to generating data used to make advancement decisions, yield trials often serve as a source of phenotypic data for training genomic prediction models to facilitate genomic selection (GS). As reviewed by Lorenz and Nice (2017), GS has revolutionized selection in breeding programs. In the context of GS, a genomic prediction model is trained with genotypic and phenotypic data and then used for selection (Jannink et al., 2010). This enables the selection of parents among genotypes that have not yet been tested extensively or at all, thereby enabling shorter breeding cycles. Training population sizes and genomic relationships between test and training sets are critical for prediction accuracy (Combs & Bernardo, 2013; Lorenz et al., 2011; Lorenz & Smith, 2015). The implementation of GS also facilitates “sparse testing” of entries across a larger number of environments (Atanda et al., 2021; Heffner et al., 2010; Verges & Sanford, 2020). In the case of sparse testing, breeding materials are intentionally evaluated in different sets of environments and analyzed together in a GS model that enables the prediction of breeding values despite the unbalanced testing design.

In most GS use cases, breeders must routinely combine data from different cohorts for analysis to train and update GS models. Data from yield trials is used for GS model training in many breeding programs because it contains many unique genotypes with high‐quality data from multiple environments (Combs & Bernardo, 2013; Lorenz & Smith, 2015). Each year, new data from yield trials is generated, facilitating the updating of GS models required to maintain accuracy (Bassi et al., 2016; Gaynor et al., 2017). The advent of GS has increased the need for effective analysis across cohorts to train accurate prediction models.

A potential problem arises when using yield trial data to train GS models if cohorts are tested in separate yield trials. Genetic differences between cohorts and non‐genetic differences between trials may be confounding (Clarke & Stefanova, 2011; Robbins et al., 2012). Furthermore, genotype‐by‐trial interaction may be present and poorly separated from other effects. These problems could potentially reduce GS model accuracy. While studies have not examined how restricting randomization to within cohort impacts GS accuracy, a study by Piepho and Williams (2006) examined the efficiency of complete randomization (CR) across all breeding materials with that of restricted randomization (RR), in which randomization was performed within groups of closely related breeding materials. The authors ultimately concluded that if selection occurs within groups, RR should be used; if selection occurs across groups, CR should be used.

The use of common entries across trials, referred to as checks, can help mitigate confounding between genetic and non‐genetic factors in testing designs and improve accuracy; however, there are costs associated with check plots. We refer to two types of check entries: design checks and performance checks. Design checks are the entries included in the statistical design solely to facilitate the estimation of genetic and non‐genetic effects as well as variances. When analyzing multiple trials together, multiple design checks must be shared between trials to estimate the trial effects appropriately. In addition, design checks are needed to estimate block effects and residual error variances in trials designed to evaluate unreplicated test entries. Design checks may occupy up to 20% of available plots in breeding trials (Chandra, 1994; Martin et al., 2006; Moehring et al., 2014), or as suggested by Fisher (1926), the square root of the number of test plots. This is a considerable expense that breeding programs must offset by testing fewer lines. In addition to design checks, performance checks represent a market variety and product profile that is to be replaced, acting as a benchmark to which test lines are compared (Cobb et al., 2019). Occasionally, performance checks will be used as design checks to save resources; however, complete reliance on performance checks for the estimation of error variances, block effects, and trial effects can result in poor estimation of genetic effects for unreplicated lines if performance checks are not representative of the test entries (Moehring et al., 2014).

Rather than conduct separate trials for lines in different stages of testing and rely on design checks to ensure block and trial effects are estimated accurately, lines in different stages of testing could be randomized together. This approach could limit the confounding of genetic and non‐genetic factors while maximizing the number of plots allocated to selection candidates. The breeding program's selection and advancement procedures could remain unchanged. If all cohorts are to be replicated, combining different cohorts together in the same trial can be achieved with familiar designs such as RCBD or incomplete block designs. Typically, in a breeding program, at least one cohort will not be replicated, in which case partially replicated designs, or p‐reps, first proposed by Cullis et al. (2006), are useful. In a *p‐*rep, test entries that are at later stages of testing can serve as the replicated entries, enabling the estimation of block effects and the error variance. An integral aspect of p‐reps is that the replicated entries are randomized together with unreplicated entries in a single design to avoid design checks, thereby maximizing the number of test plots for a given budget. Importantly, the percentage of plots that are replicated test plots should be at least 20% to estimate the error variance and block effects effectively. Clarke and Stefanova (2011) found that the increased number of replicated test plots replacing check plots within a p‐rep design increased error degrees of freedom (df) and average precision.

Prior research has not explicitly examined how restricting randomization to within cohorts affects selection accuracy. Statistical design principles and a previous study by Piepho and Williams (2006) led us to hypothesize that CR may help improve the accuracy of selection based on breeding value estimates resulting from a combined analysis across trials. Hence, the overarching goal of this research is to test the hypothesis that switching from RR within breeding cohorts to CR across breeding cohorts improves conventional and GS accuracy when selection is performed across cohorts.

MATERIALS AND METHODS

2

To test our hypothesis, we simulated breeding program scenarios replicating CR and RR conditions. The following overview outlines our approach before detailing specific methods in subsequent subsections. We first establish key terminology relevant to the central question, then detail the simulation of breeding values and phenotypes, and finally describe the statistical models used to estimate selection accuracy.

Throughout this work, we define an environment as a unique location‐year combination (e.g., Urbana South Farm—2022). Within each environment, there exist spatially separate areas suitable for research plots (Figure 1). Areas are central to our comparison of two randomization strategies. Within this study, the term “area” describes sections within a single environment that may differ in soil characteristics, microclimate, or management timing—subdivisions that commonly serve as either trials or blocks depending on the experimental design. In RR, cohorts are allocated to different areas within an environment, with each area functioning as a distinct trial containing separate breeding cohorts (e.g., preliminary and advanced trials). In CR, all cohorts are randomized together as a single trial (e.g., p‐rep design), with areas serving as blocks within that trial. Whether an area exists as a trial in RR or a block in CR, the number of test plots per area and the number of areas per environment were held constant between pairwise comparisons of RR and CR to ensure experimental integrity (with one exception at high replication, as shown in Figure 2).

Depiction of the testing designs and hierarchy of genetic correlations simulated. A whole environment is represented with a grey rectangle, while areas within environments are represented by the two colored rectangles within each grey rectangle. Five environments were simulated with varying levels of genetic correlation between environments, rj1j2. Under restricted randomization (RR), the areas within environment represent two separate trials with preliminary cohorts, S1 and S2, randomized separately from advanced cohorts, S3 and S4. Under complete randomization (CR), areas within environment represent blocks of a single trial with preliminary and advanced cohorts randomized together. Five levels of genetic correlation between areas were evaluated (rk1k2 = 1.0, 0.8, 0.6, 0.4, or 0.2). Lower rk1k2 values indicate stronger genotype‐by‐area interaction, which may occur due to spatial separation or differential management of the areas.

Corresponding design configurations for each replication level. Design configurations are shown as heatmaps for a single simulation run. Cohorts are color‐coded with S1, S2, S3, and S4 corresponding to dark blue, light blue, dark orange, and light orange, respectively. Replication level, replication per cohort, total test plots per location, and total check plots per environment are displayed next to each design. Comparable pairs of design configurations are shown side by side, with restricted randomization (RR) shown on the left and complete randomization (CR) shown on the right. With RR, preliminary cohorts S1 and S2 are randomized together to simulate a preliminary yield trial (PYT) while advanced cohorts S3 and S4 are randomized together to simulate an advanced yield trial (AYT). Each pair shares constant test plot numbers, n, and maintain eight checks per block across all designs. Black boxes indicate a single area and pairwise genetic correlations (rk1k2) were simulated between these areas. Note that in the high replication scenario, RR and CR had two and three different areas, respectively. In contrast, both RR and CR had equal numbers of areas in the intermediate and low replication scenarios.

Phenotypes reflective of a highly quantitative trait similar to yield were generated to populate these experimental designs using actual marker data from University of Illinois winter wheat breeding lines. Marker effects were first simulated with additive genetic covariances modeled between areas within environments and between separate environments, creating genotype‐by‐environment interaction (G × E) at two spatial scales. From these correlated marker effects, additive genetic values were computed for each line within each area‐environment combination, and a single true breeding value was then calculated for each genotype as the mean across all area‐environment‐specific genetic values. Residual error was subsequently added to create phenotypes at designated narrow‐sense heritabilities (0.2–0.8). For each simulation scenario, we fitted conventional best linear unbiased prediction (BLUP), genomic BLUP (GBLUP), and genomic‐enabled sparse testing models to the simulated phenotypes. Selection accuracy was assessed as the Pearson correlation between predicted and true breeding values and compared between randomization schemes across gradients of heritability, genetic correlation between testing areas (0.2–1.0), and replication level.

Genetic materials and genotyping

2.1

The wheat breeding lines utilized in this study consisted of F_3_‐derived lines from four different breeding cohorts (Table 1) developed by the University of Illinois wheat breeding program. All four cohorts were evaluated for yield in 2022 in Illinois as part of the breeding program's standard evaluation process. Cohorts S1, S2, S3, and S4 were in their first, second, third, and fourth years of testing, respectively. Lines belonging to the same cohort were developed from the same set of parental lines and are expected to be slightly more genetically related than lines from different cohorts. Cohorts S1 and S2 were designated as preliminary, and S3 and S4 were designated as advanced (Table 1). Entire cohorts S1, S2, and S3 were included in this study. Cohort size varied due to variability in the number of breeding candidates developed each year. For cohort S4, only the lines that were advanced to the second year of testing were genotyped; thus, just 20% of the original S4 cohort was included in this study. All lines were genotyped using genotyping‐by‐sequencing performed by the Eastern Regional Small Grains Genotyping Lab as described by Gaire et al. (2021) and Gaire et al. (2022). In total, 12,505 SNPs with call rates of at least 20% were obtained and used when simulating breeding values. For analysis, markers with minor allele frequencies lower than 0.05 were excluded, resulting in 9262 SNP markers.

Assessment of population structure

2.2

Using the marker data on all lines, we computed a genomic relationship matrix (GRM) calculated according to the additive relationship approach of Endelman (2011), as implemented by the A.mat function from the rrBLUP package. A principal component analysis (PCA) was performed on the GRM using the package ASRgenomics (Gezan et al., 2022). We then examined the proportion of the variance explained by the first two principal components and created a scatter plot of the first two principal components to visualize population structure.

Simulating marker effects and breeding values

2.3

For each area within an environment, we assumed that markers had different but correlated effects to create genotype‐by‐area interaction. To accomplish this, marker effects were sampled from a multivariate normal distribution with genetic covariances between areas, simulated according to the covariance matrix R. Here, we illustrate R for the case of two environments and two areas within each environment:

[eqn]

Matrix R is the covariance between each area within an environment, with ones along the diagonal corresponding to the genetic variances simulated. The genetic covariance between environment one, area one and environment one, area two is σk12. The genetic covariance between environment one, area one, and environment two, area one is 0.5. The genetic covariance between environment one, area one and environment two, area two is 0.5. The genetic covariance between environment one, area two and environment two, area one is σk21. The genetic covariance between environment one, area two and environment two, area two is 0.5. The genetic covariance between environment two, area one and environment two, area two is σk12. The notation σk12 denotes the genetic covariance between areas k 1 and k 2 within an environment, defined as the genetic variance of σg2·rk1k2, where σg2 is the genetic variance and rk1k2 is the genetic correlation between those areas. The factor rk1k2 was manipulated in our simulation to control the magnitude of genotype‐by‐area interaction.

In our simulation, there were five environments with two areas in each, thus R was a 10 × 10 matrix. Correlated marker effects across *p = *10 areas within environments were simulated using the R function mvrnorm from package “MASS” (Venables & Ripley, 2002) using the matrix R as the covariance matrix. Each marker had p different effects corresponding to each area within an environment, and marker effects were simulated for each SNP.

Additive genetic values for each of the 4106 lines within each area within environment were generated from simulated marker effects by solving for the product of genotype matrix M of order m × n and the marker effect matrix Q of order n × p. For each line, a single true breeding value was then computed as the average across all 12 additive genetic values. Because all SNP markers were assigned an effect, the trait we simulated was highly quantitative, like yield.

Simulating phenotypes

2.4

To simulate phenotypic values from the environment and area‐specific additive genetic values for our breeding lines, we defined phenotypic value as follows:

[eqn]

where yijk is the phenotypic value of genotype i within environment j and area k, gi(jk) is the additive genetic value of line i in environment j and area k, ej is the main effect of environment j, and εijk is residual error. In each simulation, the variance of the vector of additive genetic values within an environment was taken as environment‐specific genetic variance, σg(j)2. Using the R function rnorm, values of ej were sampled from a normal distribution with a variance equal to twice σg(j)2. Values of εijk, were sampled from a normal distribution with a σ=σε2, μ=0. Within each environment σε2 was determined based on the desired level of narrow‐sense heritability within the environment on a per‐plot basis hj2, for the simulation run. This variance was calculated as follows:

[eqn]

where σε2 is the within‐environment residual error variance, hj2 is a designated narrow‐sense heritability within ej, σg(j)2 is the genetic variance within ej. Ultimately, at ej in a given simulation, the designated hj2 corresponds to:

[eqn]

where variables carry the same definitions as in Equation (3).

Error variance (σε2) was calculated at the environment level to ensure both RR and CR faced equivalent residual error variance relative to genetic variance. To be conservative, area main effects, either block‐ or trial‐specific, were not simulated. Yet, area main effects can arise due to chance. Contrary to environment main effects (*e_j_ *), area main effects would be cleanly estimable in CR designs but confounded with genetic effects in RR designs alongside rk1k2 unless common checks are present between areas. To conservatively evaluate the merit of CR relative to RR, this simulation did not consider area main effects supposing that these effects could be estimated with the checks. Genotype‐specific performance varied between areas within a given environment solely through genotype‐by‐area interactions, implemented via correlated marker effects with correlation structure determined by rk1k2. Beyond this genetic variation across areas, phenotypic variation among replicates of the same genotype within each area arose from plot‐level residual error (εijk), which captured microenvironmental heterogeneity and measurement error. This approach enabled the evaluation of replication level, heritability (hj2), and rk1k2 effects on selection accuracy (*r_bv_ *) and avoided introducing assumptions that might favor CR over RR.

Scenarios simulated

2.5

Simulations were performed across a range of conditions that may be experienced in plant breeding trials. The controllable factors that we manipulated were the aspects of the testing design, which included the randomization scheme (CR or RR) and replication level. The uncontrollable factors that we manipulated were the hj2 and the genetic correlation between areas within environments (rk1k2), creating genotype‐by‐area interaction (Table 2). We describe these factors as uncontrollable because they are largely beyond the control of researchers.

For the controllable factors, we created six different design configurations, allowing us to evaluate three different levels of replication within each of the two randomization schemes. Within each of the randomization schemes, there were three different levels of replication, resulting in a total of six different within‐environment design configurations, where each replication level was comparable to its counterpart in either CR or RR (Figure 2). In design configurations one, two, and three that employed RR, there was a preliminary trial consisting of S1 and S2 cohorts and an advanced trial consisting of S3 and S4 cohorts (Table 1). Randomization was restricted within trial, which was also the designated area within environment in RR. Design configuration one contained high replication, where the S3 and S4 cohorts were replicated three times and the S1 and S2 were replicated twice, reflective of programs that may employ elite trials in addition to advanced and preliminary (Belamkar et al., 2018). Design configuration two was nearly the same as set one, except the replication was intermediate and balanced across cohorts, with all lines replicated twice. In design configuration three, the preliminary trial was not replicated, and the advanced trial was replicated twice.

In the design configurations four, five, and six that employed CR, there was only one trial within every set consisting of all four breeding cohorts, and resolvable blocks were designated as the area within environment. The level of replication in design configuration four was high, with S3 and S4 lines replicated three times and S1 and S2 lines replicated twice, comparable to RR set one. Design configuration five had intermediate replication with all lines replicated two times, comparable to RR set two. Design configuration six had low replication, with no replication of S1 and S2 lines and two replicates of S3 and S4 lines, comparable to RR set three.

Test plot numbers were held constant between corresponding RR and CR design configurations (Figure 2), with eight check plots allocated per block across all design configurations (four unique checks, replicated twice per block). The total entry numbers differed due to variations in check‐plot composition. The RR designs simulated more total checks per trial, a pattern that reflects the RR's reliance on design checks to support the blocking structure and the RR of cohorts within them. Conversely, CR's completely randomized structure inherently requires fewer blocks, and the simulated check numbers mirrored how this design obviates the need for design checks, with a reduced check line composition reflecting the use of performance checks alone.

For the uncontrollable factors, we simulated hj2 levels of 0.8, 0.6, 0.4, and 0.2. The levels of rk1k2 simulations were 1.0, 0.8, 0.6, 0.4, and 0.2. When rk1k2=1, there is no genotype‐by‐area interaction, and we expect no advantage of CR relative to RR. When rk1k2=0.2 the genotype‐by‐area interaction is at its highest level, representing the most challenging scenario. All possible combinations of hj2, rk1k2, and design configuration were considered, for a total of 120 different scenarios. For every scenario, we simulated five environments, assuming a genetic correlation between environments of 0.5. Conventional selection underwent 500 simulation runs for every unique scenario, while GS underwent 50 simulations due to computational load. In addition to these balanced multi‐environment testing scenarios, we mimicked two sparse‐testing multi‐environment trial designs, one with RR and one with CR. To do so, we used the low‐replication design configurations three and six but only allowed the S1 and S2 cohorts to be evaluated in one of the five environments. For every unique scenario considered in the sparse testing experiment, we performed 50 simulation runs.

Breeding value estimation

2.6

To estimate breeding values from each simulated dataset, we fit mixed models with environment‐specific genetic variances and environment‐specific block variances according to A. B. Smith et al. (2007):

[eqn]

where yijk is the phenotypic value for the ith genotype in the kth block of the jth environment. The mean is μ, the fixed effect of environment j is ej, the random effect of block k nested within environment j is bk(j) with bk(j)∼N(0,B), where B contains environment‐specific block variances on the diagonal. The random effect of genotype i within environment j is gi(j). We assumed gi(j)∼N(0,Gg). The variance structure Gg was given by:

[eqn]

where Gg is an i × i genetic variance matrix with environment‐specific genetic variances on the diagonals. The covariance matrix Gm is a m×m genetic covariance matrix for lines. We had two assumptions for Gm. First, lines were assumed unrelated and Gm was an identity matrix. We refer to this method as “conventional selection.” Next, lines were assumed to be related and Gm was a GRM calculated previously. We refer to this method as GS. Models were implemented in ASREML‐R Version 4.2 (Butler et al., 2018).

The conventional selection model was evaluated for all 120 scenarios, as the absence of marker data permitted this comprehensive assessment with reasonable computational load. For GS, we examined the low replication scenarios, including sparse testing, because these are of greatest interest to plant breeders performing GS and because interactions with the randomization scheme and replication level were not detected in our initial analyses using a conventional selection model. In total, GS accuracies were evaluated for 40 unique simulation scenarios across the varying levels of hj2 and rk1k2. In each simulation and for each analysis method, selection accuracy was assessed using a Pearson's correlation between the breeding value estimates and the true breeding values simulated, referred to as *r_bv_ *. For conventional selection and GS, *r_bv_

was evaluated across all cohorts within a simulation. For genomic‐enabled sparse testing, *r_bv_
was evaluated for only the test set, containing only the untested S1 and S2 cohorts. r¯ bv was calculated by averaging across all *r_bv_
calculated from all individual runs of a simulation.

Analyses of simulation results

2.7

Analysis of variance (ANOVA) was performed to examine the main effects of randomization scheme, hj2, replication level, and rk1k2 on selection accuracy. The ANOVA model included all main effects plus two‐way interactions between randomization scheme and each of the other factors, as described by the following equation:

[eqn]

where yijklm is the observed selection accuracy, μ is the overall mean; zi, hj rk, ρl are the fixed main effects of the ith randomization scheme of either CR or RR, jth heritability level, the kth rk1k2 level and lth replication level; (h×z)ik, (r×z)jk, (ρ×z)kl, are the fixed two way interaction effects between randomization scheme z and each of heritability, rk1k2, and replication level, respectively, νm∼N(0,σν2) is the random effect of the mth simulation run accounting for stochastic variability across independent runs, and *ϵ_ijklm_

is the residual error term assumed to be independent and identically distributed. For GS and sparse testing analyses, where only low replication scenarios were evaluated, the ANOVA was updated to omit the replication factor ρl from both the main and interaction effect terms.

To quantify how selection accuracies differ for RR and CR across a range of factors, we employed difference‐in‐differences (DiD) estimation, allowing us to isolate randomization scheme‐specific responses as experimental conditions deteriorate from optimal. This approach requires three key assumptions that our simulation satisfies: (1) consistency through well‐defined treatment groups, (2) positivity through equal allocation of genotypes and factor levels across designs, and (3) parallel trends demonstrated by equivalent performance (±0.001 r¯ * bv *) under optimal conditions (Rothbard et al., 2024). Establishing these parallel trends requires designating control and intervention groups that perform equivalently at baseline; subsequent divergence can then be causally attributed to differential responses rather than pre‐existing differences. Following this DiD convention, we assigned the established practice (RR) as the control group and the alternative approach (CR) as the intervention group. The optimal factor combination that serves as our baseline of equivalent performance was hj2 = 0.8, high replication, and rk1k2= 1, representing the best‐case scenario for a breeding program in which selection accuracies are highest. In this case, parallel trends are achieved given that selection accuracies are equivalent for both RR and CR at baseline. Post‐intervention conditions were those with lower levels of hj2, replication, and rk1k2. The DiD model was specified as follows:

[eqn]

where the expectation operator, E(Y|Z,T), is the expected selection accuracy Y, for any individual simulation run given the randomization scheme where Z 0 = control group, RR, and Z 1 = intervention group, CR, and factor conditions where T = 0 for optimal, that is, pre‐treatment baseline, and T = 1 for suboptimal, that is, post‐treatment. β0,t is the baseline intercept for optimal conditions of our control group RR, β1I(T=1) is the effect of suboptimal conditions on the prediction accuracies in the control group RR, β2Z is the expected difference between prediction accuracies of control RR and intervention CR under optimal conditions, and finally, our coefficient of interest δ̂ DiD, the DiD estimator, is represented by β 3, and quantifies the differential response between randomization schemes as a given factor (hj2, rk1k2, or replication level) progressively deteriorates from optimal to suboptimal. Positive DiD values indicate greater performance of the intervention group, CR, relative to the control group, RR, as experimental conditions become less conducive to higher prediction accuracies.

RESULTS AND DISCUSSION

3

Population structure among the breeding lines was slight

3.1

Little to no population structure was observed among our breeding lines. The PCA revealed that PC1 and PC2 accounted for only 5.8% and 5% of the total variation, respectively (Figure 3). These relatively low percentages are not surprising because all lines belong to four consecutive cohorts of a single breeding program. Nevertheless, some separation of preliminary and advanced breeding lines is evident, which is expected given that lines of different cohorts were developed from different, albeit related, sets of parental lines. The genetic differences between preliminary and advanced lines are expected to lead to confounding between genetic and trial effects when these groups are evaluated in separate trials.

Principal component analysis based on the genomic relationship matrix. Principal component one (PC1) and principal component two (PC2) computed from the additive genomic relationship matrix of the 4102 wheat lines used in this study are plotted on the x and y‐axis, respectively. The first two principal components explain 5.8% and 5% of the variation. Lines are color‐coded indicating to which cohort they belong. Cohort S1, S3, S3, and S4 are indicated with dark blue circles, light blue diamonds, red squares, and yellow triangles, respectively. Lines clustering with the preliminary cohorts (S1 and S2) are circled in blue, and lines clustering with the advanced cohorts (S3 and S4) are circled in orange.

Conventional selection accuracy improved with CR

3.2

Conventional selection accuracies were higher for CR than RR (Figure 4), but their relative performance depended on the level of rk1k2, and to some extent hj2 and replication level (Figure 5). ANOVA confirmed that all main and interaction effects of factors in our simulation, including the randomization scheme, had highly significant main effects on conventional selection accuracy (p ≤ 0.001, Table 3). Among the main effects, hj2 exerted the largest influence on selection accuracy, followed by randomization scheme and replication level, while rk1k2 showed the smallest main effect. However, the interaction effects revealed a contrasting pattern. Among the interactions with randomization scheme, rk1k2 exhibited the strongest interaction effect on selection accuracy, followed by replication level, while the hj2 × randomization scheme interaction was comparatively weak (Table 3). This pattern suggests that rk1k2 is a key determinant of whether randomization scheme influences selection outcomes, despite having a smaller main effect on selection accuracy than hj2, replication level, and randomization scheme alone.

Conventional selection accuracy across experimental factors. Box‐and‐whisker plots show distributions of best linear unbiased prediction (BLUP) accuracies across 500 simulations in restricted randomization (RR, blue) or complete randomization (CR, orange) as rbv decreases along the y‐axis. Panels are arranged by narrow‐sense heritability within environment hj2, which ranged from 0.8 to 0.2, and replication level, which was either low, intermediate, or high. Genetic correlation between areas, rk1k2, is shown decreasing across the x‐axis, representing increasing genotype‐by‐area interaction.

Difference‐in‐differences (DiD) analysis of conventional selection accuracies. Accuracy of best linear unbiased prediction (BLUP), rby, is shown along the y‐axis for restricted randomization (RR, blue; control) and complete randomization (CR, orange; intervention) as experimental conditions progressively deteriorate from an optimal baseline, moving from left to right on the x‐axes. Narrow‐sense heritability within environment, hj2, panel A; Deterioration in replication level, panel B, and genetic correlation between areas rk1k2, panel C is examined individually while keeping the other parameters at their optimal levels. Light lines denote individual simulation runs (n = 500 per scenario), while bold lines show the overall mean trend of the given design. The vertical dotted line (“baseline”) marks the optimal factor combination (hj2 = 0.8, high replication, rk1k2 = 1), at which RR and CR perform equivalently (rbv ≈ 0.97), supporting the parallel‐trends condition for causal interpretation. Shaded regions between design lines illustrate the magnitude of difference in rbv , with changes in the width of these regions relative to baseline visualizing the DiD effect.

Under optimal conditions (hj2 = 0.8, high replication, and rk1k2= 1), both RR and CR achieved identical selection accuracies (r¯ * _bv _ *= 0.97, Figure 4). While such ideal conditions are unachievable, this idealized baseline enables us to make causal inferences about how randomization schemes respond differently to deteriorating experimental conditions (Table 4). The observed equivalent baseline performance satisfies the parallel trends assumption necessary for DiD, allowing subsequent divergence between CR and RR to be directly attributed to their differential responses to specific factor manipulations rather than pre‐existing performance gaps.

Above all parameters, hj2 was shown to exert the strongest main effect on selection accuracy; however, its influence remained largely consistent across randomization schemes (small hj2 × randomization scheme interaction effect, Table 3). This independence is reflected by the nearly identical response slopes between RR and CR, differing by only 0.01% on average for each level of hj2 tested (Figure 4). The DiD analysis further confirms this independence (Figure 5A), yielding non‐significant interaction coefficients (δ̂ DiD, p ≥ 0.10) between hj2 and randomization scheme, indicating no differential response between CR and RR as hj2 decreased. However, it is important to keep in mind that in practice, it is likely that low rk1k2 would be observed together with low hj2. In this DiD analysis, we fixed rk1k2 = 1.0 to examine the effects of hj2 independent of rk1k2.

For replication levels, both CR and RR were equally sensitive to reduction from high to intermediate levels. However, when replication decreased further to low levels, selection accuracies suffered more with RR than CR, declining −2.98% with RR but only −1.94% with CR (Figure 4). Low replication contains unreplicated entries, distinct from high and intermediate replication levels, where entries are replicated at least twice (Figure 2; Table 2). These results agree with research demonstrating that benefits from replication follow a logarithmic relationship, where the first replication provides substantially greater gains in statistical power than subsequent ones (Yan, 2021). ANOVA showed an extremely significant interaction between replication level and randomization scheme (p ≤ 0.001, Table 3). The DiD analysis further confirmed that the differential response between selection accuracies with RR and CR that arose as replication levels decreased was extremely significant (p ≤ 0.001, Figure 5B). However, the DiD coefficient was negligible, only corresponding to an estimated 0.5% increase in accuracy for CR as the replication level decreased (δ̂ DiD = 0.005, Figure 5B). This pattern aligns with the logarithmic relationship between replication and selection accuracy, where both randomization schemes responded similarly to initial replication reduction (high to intermediate) but diverged meaningfully only when replication dropped to unreplicated levels (low). These results are expected because low replication reduces the precision of variance component estimates, particularly when unreplicated entries are included, making accurate partitioning of genetic and error variances more difficult.

As rk1k2 decreased from 1.0 to 0.2, CR became increasingly advantageous, transforming from equal performance at rk1k2=1 to substantial CR advantages of +8.8%, +18.0%, and +18.6%, corresponding to rk1k2 = 0.6, 0.4, and 0.2, respectively (Figure 4). This finding is strongly supported by the DiD analysis results (Figure 5C), which confirmed that the differential response in selection accuracy between randomization schemes was due to the progressive deterioration of rk1k2 (p ≤ 0.001). The DiD coefficient was substantial (δ̂ DiD = 0.082), corresponding to an estimated 8.2 percentage point advantage for CR over RR with each unit decrease in rk1k2. These results strongly suggest that when different cohorts are to be analyzed together for selection in the absence of genomic relationship information, they should be randomized together.

GS accuracy was unaffected by randomization scheme under complete testing

3.3

When all cohorts were evaluated across all five environments and genomic relationship information was incorporated through GBLUP, the confounding between genetic and non‐genetic effects due to RR was effectively mitigated. The GRM provided connectivity between spatially separated cohorts that reduced design‐dependent effects on GS accuracy. Both CR and RR achieved nearly identical GS accuracies across all levels of hj2 and rk1k2 examined (Figure 6). Mean GS accuracies differed by <0.5% between designs across all scenarios, with both approaches maintaining high accuracy even under challenging conditions of low heritability and strong genotype‐by‐area interaction.

Genomic selection accuracy under low replication across experimental factors. Box‐and‐whisker plots show the distributions of genomic best linear unbiased prediction (GBLUP) accuracies across 50 simulations in restricted randomization (RR, blue) or complete randomization (CR, orange) with rbv along the y‐axis. Panels are arranged by narrow‐sense heritability within environment, hj2, ranging from 0.8 to 0.2. Genetic correlation between areas, rk1k2, is shown decreasing across the x‐axis, representing increasing genotype‐by‐area interaction.

ANOVA confirmed that while randomization scheme significantly influenced GS accuracy (p ≤ 0.001, Table 3), its impact was substantially smaller compared to that of conventional selection. Main effects for experimental factors were significant (p ≤ 0.001), and like with conventional selection, the main effect for hj2 exerted the largest influence on GS accuracy. However, the ranking of the remaining factors reversed: rk1k2 demonstrated greater influence on GS accuracy than the randomization scheme, whereas the randomization scheme exceeded rk1k2 for conventional selection. This pattern reflects not a heightened impact of rk1k2 in GS, but rather a negligible effect of randomization scheme in this context. A key finding was the absence of significant interactions between rk1k2 and randomization scheme, as well as between hj2 and randomization scheme (*p *> 0.10). Furthermore, effect sizes for both main and interaction effects involving hj2, rk1k2, and randomization scheme were substantially smaller with GS than with conventional selection (Table 3). These results indicate that the inclusion of genomic information effectively mitigated the harm of RR on selection accuracy found in conventional selection.

The DiD analysis results further confirmed the absence of meaningful differences between randomization schemes for GS with complete testing (Figure 7; Table 4). As rk1k2 progressively deteriorated, the differential response in GS accuracy between CR and RR was found to be non‐significant (*p *> 0.10), indicating that CR provided no measurable advantage over RR even as genotype‐by‐area interaction increased. This stark contrast with conventional selection (Figure 5C) demonstrates that the GRM can effectively reduce the confounding between genetic and environmental effects that occurs under RR when cohorts are restricted to separate trials. These findings suggest that breeding programs routinely incorporating GRMs in their analyses with balanced multi‐environment testing can use either CR or RR without compromising selection accuracy. The ability of the GRM to capture the genetic covariance structure among individuals across cohorts eliminates the need for CR as a mechanism to prevent genetic and non‐genetic confounding, at least when all genotypes are evaluated in all environments. However, it is important to point out that how well the GRM can reduce confounding in RR testing schemes depends on the extent to which the genotypes evaluated are genetically related.

Difference‐in‐differences analysis of genomic selection under low replication. Accuracy of genomic best linear unbiased prediction (GBLUP), or rby , is shown along the y‐axis for restricted randomization (RR, orange) and complete randomization (CR, blue) designs. Light lines represent individual simulation runs (n = 50 per scenario), while bold lines show the overall mean trends of either design. Black error bars at the left side of the plot display heritability‐specific means ± standard deviation, with corresponding within environment narrow sense heritability, hj2, values labeled on the left. The vertical dotted line marks the optimal baseline conditions where both designs achieve near‐equivalent performance, supporting parallel trends assumptions for causal inference. Shaded regions between design trendlines illustrate the magnitude of difference in rbv as the genetic correlation between areas, rk1k2, progressively deteriorates along the x‐axis past the intervention point, with minimal change in the width of these regions from baseline, illustrating the non‐significant difference‐in‐differences (DiD) effect.

CR improved GS accuracy under sparse testing

3.4

In stark contrast to complete testing where genomic information nearly eliminated design dependency, sparse testing revealed differences in design performance, with GS accuracies lower under RR than CR. In CR, GS accuracies exceeded those of RR by 1.5% across all simulated scenarios (Figure 8). At times, GS accuracies with RR were slightly higher than CR when rk1k2=1, but the comparative advantage of CR increased as rk1k2 decreased. Under the most challenging scenario of low hj2=0.2 and rk1k2=0.2, GS accuracies from CR were 5.5% higher than those of RR.

Genomic selection accuracy of sparse testing designs under low replication across experimental factors. Box‐and‐whisker plots show distribution of genomic best linear unbiased prediction (GBLUP) accuracies, rbv , across 50 simulations in restricted randomization (RR, blue) or complete randomization (CR, orange) with rbv along the y‐axis, where, for a given run: S3 and S4 cohorts are evaluated in all environments, S1 and S2 cohorts are evaluated in only one of five environments, and rbv is calculated based on the predictive ability of GBLUP for S1 and S2 cohorts in the four untested environments. Panels are arranged by narrow‐sense heritability within environment, hj2, which ranged from 0.8 to 0.2. Genetic correlation between areas, rk1k2, decreases across the x‐axis, representing increasing genotype‐by‐area interaction.

In sparse testing scenarios, hj2 remained the dominant main effect on selection accuracy (p ≤ 0.001, Table 3), consistent with results from conventional selection and GS with complete testing. The interaction between hj2 and randomization scheme, however, remained non‐significant (p ≥ 0.10, Table 3), maintaining the pattern observed in GS with complete testing. However, a key finding emerged: the rk1k2 × randomization scheme interaction returned to statistical significance under sparse testing (p ≤ 0.001, Table 3). Both the magnitude of this interaction effect and the main effect of randomization scheme positioned sparse testing as an intermediate case, exhibiting greater robustness to randomization scheme than conventional selection, yet more sensitivity to randomization scheme than GS with complete testing. These results indicate that rk1k2 resumes its role as a key determinant of whether randomization scheme influences GS accuracy when phenotypic connectivity across environments is reduced through sparse testing.

The DiD analysis results further supported this finding, showing that as rk1k2 progressively deteriorated, the differential response in selection accuracy between CR and RR became marginally significant (0.05 < p ≤ 0.10, Figure 9). The modest DiD coefficient (δ̂ _DiD _= 0.018) corresponded to CR maintaining an estimated 1.8 percentage points higher GS accuracy than RR for each unit decrease in rk1k2. These DiD results contrast with those observed for GS with a balanced multi‐environment testing design, where designs performed almost identically with non‐significant DiD coefficients when rk1k2 was the examined factor.

Difference‐in‐differences analysis of genomic selection accuracy for sparse testing designs under low replication. Accuracy of genomic best linear unbiased prediction (GBLUP), rby , is shown along the y‐axis for restricted randomization (RR, orange) and complete randomization (CR, blue); where, for a given run: S3 and S4 cohorts are evaluated in all environments; S1 and S2 cohorts are evaluated in only one of five environments, and rbv is calculated based on the selection accuracy of GBLUP for S1 and S2 cohorts in the four untested environments. Light lines represent individual simulation runs (n = 50 per scenario) for CR and RR designs, while bold lines show overall mean trends. Black error bars at the left side of the plot display heritability‐specific means ± standard deviation, with corresponding within‐environment heritability ,hj2, values labeled on the left. The vertical dotted line marks the optimal baseline conditions where both designs achieve near‐equivalent performance, supporting parallel trends assumptions for causal inference. Shaded regions between design trendlines illustrate the magnitude of difference in rbv as genetic correlation between areas, rk1k2, progressively deteriorates along the x‐axis past the intervention point, with the widening of this region from baseline visualizing the significant difference‐in‐differences (DiD) effect.

GS accuracy only sometimes benefited from CR

3.5

Altogether, these findings suggest that the addition of a GRM can nearly eliminate the genetic and non‐genetic confounding due to RR in complete multi‐environment testing designs, but when lines are not tested in all environments and RR is utilized within environments, the confounding effect becomes too great for the GRM to fully eliminate. The advantage of CR when GS is performed with sparse testing fell between that of conventional selection and GS with complete testing across all environments. This pattern reflects reduced connectivity between environments under sparse testing, which increases confounding between genetic and non‐genetic factors, ultimately impacting GS accuracies. Recall that sparse testing in this simulation implied that lines belonging to the S1 and S2 cohorts were only tested in one of the five environments, aligning these results with that of Atanda et al. (2022), who demonstrated that experimental design directly affects how effectively genomic prediction models leverage relationship matrices to predict performance in untested locations under sparse testing. Completely randomized designs mitigated these challenges by improving variance component estimation despite G×E interaction and unbalanced testing across locations, demonstrating that CR offers tangible benefits in the genomic‐enabled sparse testing designs that are becoming increasingly common in breeding programs.

CONCLUSION

4

Our simulation study demonstrates that randomization schemes within experimental designs remain consequential even in the GS era, with impact varying based on data completeness and the extent to which genotype performance varies across areas and environments. Two major conclusions from this study emerged. First, the advantage gained from CR depended on connectivity between areas within and across environments. When data connectivity is high, either randomization scheme performs well. However, when data are less connected, either through sparse testing or absence of genomic relationship information, CR outperforms RR, and this performance disparity grows as experimental conditions become less forgiving, showing up to 15.7% higher conventional selection accuracy in simulations with low replication (Figure 4). Second, rk1k2 is the primary driver of performance differences between RR and CR (Figure 5). In situations where higher genotype‐by‐area interaction is present, such as when different areas are sown on different dates, experience different management, or have different soil types, CR designs could be especially advantageous.

These findings translate to practical recommendations for breeding programs that wish to obtain accurate breeding value estimates from analyses across multiple breeding cohorts, including from GS models: (1) for programs that routinely use GRMs in analyses and work with mostly balanced multi‐environment testing data, either randomization scheme can be used to obtain accurate predictions (Figures 6 and 7); (2) for programs that do not genotype routinely (Figures 4 and 5) or those that are aiming to optimize their resource allocation using sparse testing (Figures 8 and 9), CR is recommended. Although this research demonstrates that restrictions on randomization within experimental designs in a breeding program can impact a breeder's accuracy of selection across multiple cohorts, there are additional factors to consider when deciding to switch from RR to CR. Some breeders prefer RR to facilitate visual comparisons between individuals within the same cohort or to more easily take specific data or grain samples on a specific cohort of interest. Although CR does not prevent visual comparisons or selective data collection, it can substantially increase the practical difficulty of doing so. In addition, some breeders may only require that within‐cohort selection accuracy be maximized. In which case, the conventional practice of evaluating and analyzing cohorts in separate trials should be followed. However, if gain from selection over time is desired, breeders should be concerned about the accuracy of parent selection, as this will affect the rate of genetic gain realized. When parent selection is of interest, analyzing multi‐environment, multi‐year data on multiple cohorts together in a GS model becomes more important, and therefore CR should be considered.

Some breeders have expressed concerns about CR that this simulation did not examine. First, CR could increase experimental error because preliminary cohorts may have taller height or lower lodging tolerance, thereby affecting neighboring plots. While this may be possible, in our experience with elite wheat germplasm the phenotypic differences between different cohorts are not that extreme. Certainly, the potential for this problem will be germplasm dependent. Another common concern about CR is that block sizes will be large within trials if all cohorts are randomized together, and therefore blocking will be less effective at controlling experimental error. This issue is easily solved by simply adding more incomplete blocks to the trial and, if necessary, additional design checks. Design checks can provide connectivity across incomplete blocks regardless of the number of blocks included, allowing for flexible expansion of experimental capacity.

While this study points out the benefits of CR in breeding programs, it is limited in scope. More comprehensive research examining CR and RR in breeding programs considering different germplasm, crops, and design details is warranted to understand to what degree our results can be generalized more broadly. Furthermore, our simulations were conservative and even provided potential competitive advantages to RR over CR to ensure experimental integrity and unbiased assessment for considering CR: (1) area main effects were not simulated, which would be well‐determined in CR but confounded with genetic effects under RR's trial‐restricted cohort allocation; (2) error variance was calculated per designated heritability level at the location level rather than area level to avoid area‐specific heterogeneity arising from differential genetic composition across RR areas; and (3) RR required the use of design checks to allow for genetic connectivity between cohort‐restricted areas; and utilized a larger number of blocks that inherently increased check number, resulting in a larger check composition that could potentially improve estimation of non‐genetic effects relative to CR. Further, we did not consider the potential for better resource allocation with CR relative to RR due to the reduced need for design check plots. This study also does not examine the relative merits of CR and RR when cohorts are more genetically distinct than those of this study. (We expect an even greater advantage from implementing CR when genetic differences between cohorts are great.) Nevertheless, this research exposes the need to challenge conventional thinking about how breeding trials are designed now that breeding has shifted from an art form to predictive science.

AUTHOR CONTRIBUTIONS

Arlyn Ackerman: Data curation; formal analysis; investigation; methodology; software; validation; visualization; writing—original draft; writing—review and editing. Jessica Rutkoski: Conceptualization; data curation; formal analysis; investigation; methodology; resources; supervision; validation; writing—original draft; writing—review and editing.

CONFLICT OF INTEREST STATEMENT

The authors declare no conflicts of interest.

Bibliography34

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Atanda, S. A. , Govindan, V. , Singh, R. , Robbins, K. R. , Crossa, J. , & Bentley, A. R. (2022). Sparse testing using genomic prediction improves selection for breeding targets in elite spring wheat. Theoretical and Applied Genetics, 135(6), 1939–1950. 10.1007/s 00122-022-04085-0 35348821 PMC 9205816 · doi ↗ · pubmed ↗
2Atanda, S. A. , Olsen, M. , Crossa, J. , Burgueño, J. , Rincent, R. , Dzidzienyo, D. , Beyene, Y. , Gowda, M. , Dreher, K. , Boddupalli, P. M. , Tongoona, P. , Danquah, E. Y. , Olaoye, G. , & Robbins, K. R. (2021). Scalable sparse testing genomic selection strategy for early yield testing stage. Frontiers in Plant Science, 12, 658978. 10.3389/fpls.2021.658978 34239521 PMC 8259603 · doi ↗ · pubmed ↗
3Bassi, F. M. , Bentley, A. R. , Charmet, G. , Ortiz, R. , & Crossa, J. (2016). Breeding schemes for the implementation of genomic selection in wheat (Triticum spp.). Plant Science, 242, 23–36. 10.1016/j.plantsci.2015.08.021 26566822 · doi ↗ · pubmed ↗
4Belamkar, V. , Guttieri, M. J. , Hussain, W. , Jarquín, D. , El‐basyoni, I. , Poland, J. , Lorenz, A. J. , & Baenziger, P. S. (2018). Genomic selection in preliminary yield trials in a winter wheat breeding program. G 3 Genes|Genomes|Genetics, 8(8), 2735–2747. 10.1534/g 3.118.200415 29945967 PMC 6071594 · doi ↗ · pubmed ↗
5Bernardo, R. (2010). Breeding for quantitative traits in plants (3rd ed.). Stemma Press.
6Butler, D. G. , Cullis, B. R. , Gilmour, A. R. , Gogel, B. J. , & Thompson, R. (2018). AS Reml estimates variance components under a general linear mixed model by residual maximum likelihood (REML). VSN International Ltd.
7Chandra, S. (1994). Efficiency of check‐plot designs in unreplicated field trials. Theoretical and Applied Genetics, 88(5), 618–620. 10.1007/bf 01240927 24186119 · doi ↗ · pubmed ↗
8Clarke, G. P. Y. , & Stefanova, K. (2011). Optimal design for early‐generation plant‐breeding trials with unreplicated or partially replicated test lines. Australian & New Zealand Journal of Statistics, 53, 461–480. 10.1111/j.1467-842x.2011.00642.x · doi ↗