Accurate determination of breed origin of alleles in a simulated smallholder crossbred dairy cattle population

Berihu Welderufael; Isidore Houaga; R. Chris Gaynor; Gregor Gorjanc; John M. Hickey

PMC · DOI:10.1186/s12711-025-00985-z·July 11, 2025

Accurate determination of breed origin of alleles in a simulated smallholder crossbred dairy cattle population

Berihu Welderufael, Isidore Houaga, R. Chris Gaynor, Gregor Gorjanc, John M. Hickey

PDF

Open Access

TL;DR

This paper introduces a new algorithm that accurately determines the breed origin of alleles in crossbred dairy cattle, which can help improve breeding programs in low- and middle-income countries.

Contribution

A novel algorithm for breed origin assignment of alleles in crossbred cattle without requiring pedigree information is developed and tested.

Findings

01

The algorithm correctly assigned breed origin to 95.76% of alleles on average across all scenarios.

02

The consensus-based assignment achieved a mean accuracy of 1.00, indicating very high reliability.

03

Threshold levels significantly affect assignment yield and accuracy, with less stringent thresholds increasing yield but decreasing accuracy.

Abstract

Accurate assignment of breed origin of alleles (BOA) at a heterozygote locus may help to introduce a resilient or adaptive haplotype in crossbreeding. In this study, we developed and tested a method to assign breed of origin for individual alleles in crossbred dairy cattle. After generations of mating within and between local breeds as well as the importation of exotic bulls, five rounds of selected crossbred cows were simulated to mimic a dairy breeding program in the low- and middle-income countries (LMICs). In each round of selection, the alleles of those crossbred animals were phased and assigned to their breed of origin (being either local or exotic). Across all core lengths and modes of phasing (with offset—move 50% of the core length forward or no-offset), the average percentage of alleles correctly assigned a breed origin was 95.76%, with only 1.39% incorrectly assigned and…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Bos taurus(bovine · species)

Figures7

Click any figure to enlarge with its caption.

Schematic representation of the simulated breeding program. A founder population on the top of the figure is split into exotic and local breeds

Illustration of a core and offset. Phasing was performed in two modes: either using the whole length of a core or by moving it forward 50% of the core length (offset) to define the beginning of a given core

Phasing and assignment algorithm overview. a Phasing was performed using AlphaPhase, b Hypothetical haplotype libraries based on a core length of ten SNPs. To assign origin to the haplotype of an individual (bottom genotype sequence), the algorithm searches for the best match in each position in the exotic (top left genotype sequence) and local (top right genotype sequence) haplotypes. In this case, the individual’s haplotype should be assigned as a local haplotype because the local haplotype library contains the haplotype with the fewest number of errors, i.e., mismatches (red)

Plot of principal component analysis of SNP genotypes (PC1 vs. PC2 and PC1 vs. PC3). a Genetic data structure of the founders and the first crossbred cows, b of all animals across generations

Effect of core length on assignment yield. a Phasing yield was very high for all core lengths but more concise for longer core lengths (core length of 220–280 SNPs), b Assignment yield was optimal for a core length of 200 SNPsTable 4Percentages of alleles correctly assigned a breed origin (%Correct), incorrectly assigned (%Incorrect), missing or unassigned (%Unassigned), and accuracy of assignment (Accuracy) for each core-length (Core)Core% Correct% Incorrect% UnassignedAccuracy10094.701.353.950.9912094.891.123.990.9914096.141.112.750.9916096.401.162.440.9918097.391.251.350.9920098.351.360.290

Percentage of allele assignment yield and accuracy of assignment. a Yield using the consensus-based allele assignment algorithm as a function of threshold level, b accuracy using the consensus-based allele assignment algorithm as a function of threshold level

Funding1

—http://dx.doi.org/10.13039/100000865Bill and Melinda Gates Foundation

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenetic and phenotypic traits in livestock · Genetic Mapping and Diversity in Plants and Animals · Wheat and Barley Genetics and Pathology

Full text

Background

Dairy cattle production in low- and middle-income countries (LMICs) is characterised by low-input and low-output production systems. To increase the milk productivity of dairy cattle, crossbreeding between the high-producing breeds of developed countries and the low-producing, but resilient breeds of LMICs has been practised for decades. Crossbreeding, either via the importation of semen from elite bulls or the use of imported bulls, has substantially increased milk production and farmers’ income [1]. However, this genetic gain has not always been observed, and overreliance on import without judicious use of best alleles is not expected to deliver the best possible genetic gains.

In many LMICs, including those in Eastern Africa, efforts are being undertaken to establish sustainable breeding programs for long-term genetic gains with a focus on smallholder farmers [2]. In partnership with government and non-government organizations, projects like the African Dairy Genetic Gains (ADGG, https://africadgg.wordpress.com) have been able to import and provide improved dairy genetics to smallholder farmers in the Eastern Africa. However, because of the differences in environmental factors and breeding infrastructure, the importation and provision of improved genetics have not yet been sustainable and successful [2]. Instead, such crossbreeding practices have led to haphazardly admixed cattle populations with no or poor pedigree records [2].

For a sustainable breed improvement through genetic intervention and for the appropriate design of breeding programs, accurate breed identification, on both the level of the individual and of the individual genetic variant, is important. In livestock populations with little or no pedigree records, the use of genomic data could be transformational in determining breed composition and establishment of breeding programs [2, 3]. Estimates of breed composition and the breed origin of alleles (BOA) from genomic data is superior to estimates from pedigree data due to invariably missing or inaccurate records and deviations from expected compositions due to Mendelian sampling [4, 5]. Especially in populations with complex ancestries like the dairy cattle in the Eastern Africa, genomic data and knowledge of breed composition is essential to evaluate the performance and adaptability of the crossbreds [5], and to predict the effectiveness of any foreign germplasm in the production systems.

Selection, genetic discovery and management decisions can be aided by determining the BOA, particularly for genetic variants that only occur in one of the constituent populations of crossbred animals [6]. Unlike determining the average breed composition of an individual, determining the breed origin of an individual’s haplotypes and associated alleles can allow breed-specific genetic evaluations to be conducted, which can increase the accuracy of genetic selection, particularly when the linkage disequilibrium pattern is different in the two breeds [7]. Thus, recent studies in admixed cattle populations have shown that the BOA method has increased the accuracy of genomic prediction [8, 9].

Using only genomic data with no pedigree data, Vandenplas et al. [6] developed an approach that traces haplotypes of crossbred animals and assigns each allele of the haplotypes to their breed of origin. To develop the algorithm that assigns alleles of crossbreds a breed origin, they simulated a three-way pig-crossbreeding program with five generations of random selection. They evaluated the accuracy of the algorithm and reported that more than 90% of alleles of crossbred animals were correctly assigned a breed origin. Thus, for up to 10% of all alleles of crossbred animals, they could not assign a breed origin. However, accurate determination of the BOA of crossbred populations is very important to estimate breed-specific effects of alleles when performing genomic evaluations [10]. If we could accurately assign breed origin for alleles at heterozygote loci of crossbred animals, we may be able to detect which haplotypes should be promoted to genetically improve dairy cows in the LMICs.

In the current study, we developed an algorithm to assign a breed of origin for alleles in crossbred dairy cattle and tested it on a simulated smallholder dairy cattle population dataset. To resolve the BOA, the algorithm aligns the haplotypes of crossbred dairy cows to the haplotypes of likely constituent breeds, i.e., imported (exotic) and/or local breeds and assigns the breed of origin based on the best match. We then evaluated the algorithm's accuracy using a simulated crossbreeding program designed to mimic the ADGG smallholder genotype data. The average percentage of alleles correctly assigned a breed origin was 95.76%, resulting in a high core-based mean accuracy of 0.99 and a very high consensus-based mean accuracy of 1.00. The developed algorithm does not require prior pedigree knowledge and is, hence, straightforward to apply in LMIC breeding programs.

Methods

The design of the breeding program and development of the allele assignment algorithm involved two steps.

We designed a breeding program and simulated genotype data on which we tested the algorithm's performance. The simulated genotype data had an ancient cattle founder that is assumed to have split into African (local) and European (exotic) cattle populations. After generations of mating within and between local breeds and the importation of exotic bulls, crossbred dairy cows were created to mimic the dairy cows kept by smallholders in the LMICs.
We developed an allele assignment algorithm that traces haplotypes and assigns a breed origin for each allele of the crossbred cows. The haplotypes are phased and defined for different core lengths to improve the accuracy and efficiency of the allele assignment algorithm.

The following subsections describe the details for simulating and phasing genotypes and developing the allele assignment algorithm.

Simulation of genotype data

Genotype and haplotype data for an ancient cattle breed were simulated using the AlphaSimR package [11], designed for stochastic simulations of breeding programs. A total of 2500 individual animals with a genome structure of 1000 Single Nucleotide Polymorphisms (SNPs) in one autosomal chromosome were simulated. The ancient cattle breed split into two, each representing an exotic breed and an indigenous breed. The indigenous breed further split into four more closely related local founder populations. Variation in the demographic history of these founder populations were accounted for in the simulated biallelic haplotypes of the breeds using the Markovian Coalescent Simulator (MaCS) software [12] embedded in the AlphaSimR package [11]– [See Additional file 1, Script S1] for details. As described in the AlphaSimR, the genotypes and haplotypes of the descendants, i.e., the crossbred animals, were then derived from these haplotypes using simulated mating between the exotic and local breeds. After within and between breed random mating of indigenous animals for 10 generations, the 1000 best females were selected on genetic merit of a single hypothetical trait with a small amount of dominance (mean dominance degree of 0.1 and variance of dominance degree of 0.1) and heritability of h^2^ = 0.3. The 1000 selected local cows were then mated to 25 imported exotic bulls to produce the first crossbred animals (crossbred1). The local cows were allowed to calve twice producing a total of 2000 offsprings with the assumption of 1000 female and 1000 male calves. The breeding program continued by using all the 1000 female calves (crossbred1) as replacement heifers and mating these to 25 newly imported exotic bulls to produce the next crossbred cows (crossbred2), while both exotic and local populations were kept as purebred and source of purebred animals. This importation of exotic bulls and mating to the crossbred cows was repeated for up to five rounds of selections, hereafter referred as generations (Fig. 1). Simulated genotype and haplotype data were generated in 10 replicates.Fig. 1. Schematic representation of the simulated breeding program. A founder population on the top of the figure is split into exotic and local breeds

Genetic structure of the simulated SNP genotype data

To assess the genetic similarity between the founders and developed crossbred animals, we performed principal component analysis (PCA) of SNP genotypes on the simulated data. The PCA was performed using the prcomp command of the R statistical software [13]. Level of genetic differentiation between the founder populations (exotic and local) was also quantified using the global Wright’s F_ST_ statistic [14].

Phasing of simulated genotype data

True simulated genotype and haplotype data enabled us to calculate the phasing yield and allele assignment accuracy. From the genotype data, haplotypes were reconstructed and compared with the simulated haplotypes. The reconstruction of possible haplotypes from the genotype data via phasing was performed using the AlphaPhase software [15]. Different core and tail lengths govern the length of desired haplotype segments used to phase the alleles in the genotype data. As illustrated in Fig. 2, a core is a string of consecutive SNP loci used to phase a given genome region [15].Fig. 2. Illustration of a core and offset. Phasing was performed in two modes: either using the whole length of a core or by moving it forward 50% of the core length (offset) to define the beginning of a given core

Phasing of the simulated genotype data was performed using a wide range of core and tail lengths. Preliminarily analyses suggested that core lengths of 100–280 SNPs would yield optimum allele assignments. Therefore, for the final analyses, we defined 10 different core lengths centred around 190 SNPs (Table 1) and phasing was performed for each core length both in the offset and no-offset modes of the AlphaPhase [15]. We moved 50% of the core length forward to define Offset. In total, there were 200 scenarios: 10 (core lengths) × 10 (thresholds) × 2 (offset or no offset modes), each replicated 10 times.Table 1. Core lengths (in terms of numbers of SNPs) used to phase the genotype dataCore12345678910Core length (SNPs)100120140160180200220240260280

Development of allele assignment algorithm

To develop the allele assignment algorithm, we defined 10 different core lengths (Table 1). The alleles of crossbred animals were assigned a breed origin for each core length, and we call this core-based allele assignment. In the core-based assignment each allele could be assigned a breed origin as many times as there are different core lengths defined. If breed origin assignments of an allele were not the same across the different cores the most frequent breed assignment was considered as a consensus breed origin of an allele.

Core-based allele assignment

Haplotype libraries were simulated based on the phased purebred individuals in each population. As illustrated in Fig. 3, the assignment algorithm takes phased genotypes (Fig. 3a) for individuals in the crossbred population as inputs, along with haplotype libraries for the indigenous and exotic populations (Fig. 3b). To perform allele assignment, we determined whether the exotic or local haplotype library contained the best matching haplotype, i.e. the haplotype with the fewest number of mismatches with the target haplotype. The target haplotype is then assigned as originating from that haplotype library. If both haplotype libraries contain an equally good match, then the haplotype is set to missing. For example, in Fig. 3b, the target haplotype with a core length of 10 SNPs should be assigned as a local haplotype as it displays the best match, i.e. the fewest errors, with the last haplotype in the local haplotype library.Fig. 3. Phasing and assignment algorithm overview. a Phasing was performed using AlphaPhase, b Hypothetical haplotype libraries based on a core length of ten SNPs. To assign origin to the haplotype of an individual (bottom genotype sequence), the algorithm searches for the best match in each position in the exotic (top left genotype sequence) and local (top right genotype sequence) haplotypes. In this case, the individual’s haplotype should be assigned as a local haplotype because the local haplotype library contains the haplotype with the fewest number of errors, i.e., mismatches (red)

Consensus allele assignment

Allele assignment was compared to the true origin (simulated) in each phased genotype and each scenario. Phasing of simulated genotype data was performed in two modes: either using the whole length of a core, or by moving it forward by 50% of the core length (offset) to define the beginning of a given core (see next section). Assignment was performed across multiple core lengths and two modes of phasing (no offset and offset). Assignment results of each core and mode of phasing were compared and merged across cores to calculate consensus-based assignment. The most frequently observed assignment across all the replicates, core lengths, and phasing modes was then taken as the consensus-based assignment.

To optimise and fine-tune the algorithm’s sensitivity, we applied 10 different thresholds for best SNP count of match of haplotypes (Table 2). When the threshold was 0.9, this meant that the breed assignment for the allele needed to be consistent across 90% or more of the scenarios, otherwise the assignment was set to missing. To elaborate a threshold of 50%, an allele would have been assigned a breed origin of “local” if the allele had been assigned to breed “local” in more than 50% of the assignments across all the different core lengths and phasing modes. In every generation and in each replicate, every allele of the crossbred animals was assigned a breed origin in at least 200 scenarios and results were merged to calculate consensus assignment.Table 2. The different thresholds used for the best count of match of haplotypesThreshold12345678910%Matched0.500.550.600.650.700.750.800.850.900.95

Performance of the allele assignment algorithm

To evaluate the performance of the allele assignment algorithm, assignment yield and assignment accuracy were assessed in the following ways:

%Correct: the percentage of correctly assigned alleles was computed by comparing the algorithm-derived breed origin with the true BOA traced with the “pullIbdHaplo()” function of AlphaSimR [11].
%Incorrect: the percentage of incorrectly assigned alleles was computed by comparing the algorithm-derived breed origin with the true BOA traced with the “pullIbdHaplo()” function of the AlphaSimR [11].
%Unassigned: the percentage of alleles that were not assigned (unknown breed origin).
Assignment yield: the average number or percentage of alleles in crossbred animals successfully assigned to a breed origin
Accuracy: the accuracy of assigned alleles, calculated as the ratio of correctly assigned alleles to the total number of assigned alleles and incorrectly assigned alleles. We used this ratio as an allele assignment accuracy metric for each core and tail lengths.

Results

Genetic structure of the simulated SNP genotype data

From the simulated SNP genotype data, the mean estimated fixation index between the founder (exotic and local) populations was 0.09 indicating moderate differentiation between the breeds at the beginning of the breeding program. Principal component analysis (PCA) of the simulated SNP genotype data separated the crossbreds from the founder breeds (local and exotic breeds). As shown in the PCA plot (Fig. 4a), the first generation of crossbred animals (crossbred1) were positioned in between the founder populations (exotic and local). The PCA plot further revealed the genetic sub-structure from the crossbreeding program. As we continued the crossbreeding and increased the proportion of exotic genotypes, the crossbreds and the exotic breed were observed to converge into a single cluster (Fig. 4b).Fig. 4. Plot of principal component analysis of SNP genotypes (PC1 vs. PC2 and PC1 vs. PC3). a Genetic data structure of the founders and the first crossbred cows, b of all animals across generations

Allele assignment yield and accuracy

Allele assignment for each core

The average number of alleles in the crossbred animals assigned a breed origin is given in Table 3. The highest average number of unassigned alleles (29 out of 1000 SNPs) was observed in the first generation of the crossbred animals (crossbred1). The number of unassigned alleles decreased as the crossbreeding continued and the distance between the local founders and the crossbreds increased. For example, in crossbred5, where the germplasm is upgraded to almost the exotic breed, 23 out of 1000 SNPs were unassigned (Table 3).Table 3. Assignment yield and average number of alleles in crossbred cows assigned to local, exotic or not assigned at allCrossbredLocalExoticUnassignedAssignment YieldCrossbred1486486290.95Crossbred2246730240.95Crossbred3123853240.96Crossbred461916240.96Crossbred529947230.97Mean189786250.96

The genetic distance and core lengths had a clear effect on the phasing and assignment yield. For longer core lengths (core length of 220–280 SNPs), we observed a more concise and higher phasing yield (Fig. 5a). A core length of 200 SNPs was observed to be optimal for allele assignment yield (Fig. 5b). The overall average allele assignment accuracy was 0.99 (Table 4). On average, more than 95% of the assigned alleles in the crossbred animals were correctly assigned, with only less than 2% of incorrectly assigned alleles (Table 4). Both, the incorrectly assigned and unassigned proportion of alleles, either because of missing or ambiguity, were less than 5% (Table 4).Fig. 5. Effect of core length on assignment yield. a Phasing yield was very high for all core lengths but more concise for longer core lengths (core length of 220–280 SNPs), b Assignment yield was optimal for a core length of 200 SNPsTable 4Percentages of alleles correctly assigned a breed origin (%Correct), incorrectly assigned (%Incorrect), missing or unassigned (%Unassigned), and accuracy of assignment (Accuracy) for each core-length (Core)Core% Correct% Incorrect% UnassignedAccuracy10094.701.353.950.9912094.891.123.990.9914096.141.112.750.9916096.401.162.440.9918097.391.251.350.9920098.351.360.290.9922097.271.461.270.9924094.221.544.240.9826092.321.695.980.9828095.931.842.230.98Mean95.761.392.850.99

Consensus allele assignment across all cores

On consensus, the average percentage of incorrectly assigned alleles was nearly zero (Table 5). The overall mean consensus-based assignment accuracy (accuracy = 1, Table 5) was higher than the overall mean core-based assignment accuracy (accuracy = 0.99, Table 4).Table 5. Consensus-based percentages of alleles correctly assigned (%Correct), incorrectly assigned (%Incorrect), missing or unassigned (%Unassigned) a breed origin, and accuracy of assignment (Accuracy) across all core-lengths and generation for each thresholdThreshold% Correct% Incorrect% UnassignedAccuracy0.5098.401.600.000.980.5598.160.781.060.990.6097.840.661.500.990.6596.260.423.321.000.7095.790.363.861.000.7593.560.256.191.000.8092.940.206.861.000.8589.100.1210.781.000.9088.410.1011.481.000.9581.670.0818.261.00Mean93.210.466.331.00

Effect of admixture level and thresholds on assignment yield and accuracy

The threshold level had the opposite effect on assignment yield and accuracy (Fig. 6). Increasing the threshold decreased the assignment yield and increased the accuracy, whereas a less stringent threshold generated higher assignment yields. Increasing the threshold stringency further reduced the assignment yield (Fig. 6a). On the contrary and as expected, lowering the threshold reduced the accuracy (Fig. 6b).Fig. 6. Percentage of allele assignment yield and accuracy of assignment. a Yield using the consensus-based allele assignment algorithm as a function of threshold level, b accuracy using the consensus-based allele assignment algorithm as a function of threshold level

The effect of admixture level on assignment yield and accuracy was not as clear as that of threshold level. However, the assignment yield appeared to increase from the first to the later generations of crossbreds (Fig. 7a). On the other hand, the higher threshold stringency decreased the assignment yield (Fig. 7b).Fig. 7. Percentage of allele assignment yield and accuracy of assignment. a Yield using the consensus-based allele assignment algorithm as a function of crossbreeding (admixture) level, b accuracy using the consensus-based allele assignment algorithm as a function of crossbreeding (admixture) level

Discussion

In low- and middle-income countries, such as those in Eastern Africa, a large proportion of dairy production is carried out by smallholders who keep fewer than 10 cattle [16]. These cattle are mostly crosses between indigenous African breeds and exotic dairy breeds, with little phenotypic or pedigree data available [16]. Despite the need and efforts to increase the productivity of those dairy cattle, it has not been possible to implement conventional breeding programs in these populations. In populations with no or poor pedigree and phenotype records, genomic selection and other novel methods, such as an efficient algorithm to assign the BOA in those crossbred animals, are of interest. To evaluate the performance and adaptability of the crossbreds in the LMICs, methods to accurately identify the BOA on both the individual level and the individual genetic variant are important. Such methods could also provide ways to predict the effectiveness of foreign germplasm in a low-input production system [5]. For the smallholder farmers in Eastern Africa, providing methods to assign a BOA would enable better choice of exotic bulls to introduce and which local bulls to use to sustainably harness the genetics of local adaptation traits of the indigenous breeds and the high milk yield potential of exotic dairy breeds.

Different genomic tools and algorithms [6, 9, 10] have been developed to assign a breed origin to alleles in crossbred populations without needing pedigree records. Previous studies developed algorithms for detecting BOA in crossbred pigs (three-way crossing) [6] and in crossbred cows (rotational crossing) [9], both require no prior knowledge of pedigree. Similarly, our approach does not require pedigree to accurately assign a breed origin to alleles in crossbred animals. Similar with the current study where haplotypes are defined for different core lengths, Erikson et al. [9] developed their method based on comparison of haplotypes in overlapping windows. We have developed an algorithm to assign alleles a breed origin in a dairy cattle breeding program that would represent haphazardly admixed local cows and imported exotic bulls as commonly practised in LMICs. As shown in Fig. 1, we used the exotic bulls as a source of purebred genotype data to cross with the admixed local cows for five subsequent generations. The simulated genotypes of exotic purebred and local admixed breeds were phased and the haplotypes and associated alleles of the newly created crossbred cows were assigned a breed origin. The mean F_ST_ value of 0.09, suggested that the genetic differentiation between the founder populations was moderate at the beginning of the breeding program. This moderate level of genetic differentiation can be attributed to the random mating (within local and within exotic breeds) that occurred before the start of the crossbreeding program. The within exotic and within local population mating can be considered as artificial selection which can be responsible for the fixed genetic differences among them [17].

In agreement with earlier studies in crossbred pig and dairy cattle populations [6, 9], our results demonstrated that alleles of admixed crossbred cattle populations could be accurately assigned a breed origin without the need for pedigree records.

The assignment of alleles to a breed origin was performed according to haplotypes defined by different core lengths. The use of multiple core lengths allows for a more robust and accurate assignment of breed origin to alleles as explained in [15]. By testing different core lengths, we can optimize the phasing process and ensure that the algorithm performs well across various genetic distances and admixture levels. This approach helps to mitigate the risk of incorrect assignments due to variations in haplotype lengths and genetic diversity. The optimal core length centred at about 190 SNPs was identified through this process, which provided the highest assignment yield and accuracy. In a simulation study, Vandenplas et al. [6] assessed the impact of core length and observed higher assignment yield for haplotypes of longer core lengths. While this appears to be supported in our results, a core- and tail-length of 200 SNPs was observed as the optimal length for maximum assignment yield. Similarly, the impact of genetic relationship on assignment yield is comparable to values reported in simulated and empirical studies. Using simulated data, Vandenplas et al. [6], showed that a greater distance between breeds favourably affected the percentage of allele assigned, which is consistent with the highest percentage of allele assignment yield observed in crossbred5 (97%, Table 3) that are relatively distant to the local pure breeds.

The accuracy of allele assignment, both in the core-based (0.99, Table 3) and consensus-based (1.00, Table 4), across all scenarios was very high. This allele assignment accuracy is better than the results obtained from simulated (0.98) and empirical (88.57–92.45%) data [6]. The performance of the current algorithm is better than reported allele assignment accuracies of 96% using STRUCTURE 2.2 and 85% using GENECLASS 2 reported by Negrini et al. [18]. The relative performance improvement could be attributed to the optimization process of developing the current allele assignment algorithm. For example, the BOA in crossbred animals was determined after an allele assignment was evaluated for every core and haplotype library in different scenarios to reach a consensus assignment. The choice of threshold for best SNP match in haplotypes can also affect the algorithm’s assignment yield and accuracy. Instead of using fixed allele frequency and best SNP matches to assign a breed origin to alleles, the observed expected trade-offs between assignment yield and accuracy (Fig. 6) have been optimized. When the best SNP match counts in haplotypes are too low, there will be a high assignment yield but low accuracy and vice versa. In the current simulated genotype data, the best SNP match count threshold of 50–60% appeared optimal.

Despite some suggestions to use breed origin of haplotype instead of allele to reduce the effects of incorrect allelic assignments [6], the current algorithm was able to assign a breed origin to alleles as accurate as the assignment of a breed origin to haplotypes. The developed algorithm can be used to determine a BOA as needed in genomic prediction models where breed-specific effects are estimated [19, 20]. The developed algorithm can also be used in modelling breeding programs of admixed populations. Accurate breed identification, on both the level of the individual and of the individual genetic variant is critical to achieving sustainable breed improvement. In the current simulation study, we developed an algorithm, which assigns alleles in crossbred dairy cows to the haplotypes of likely constituent breeds, i.e. either to exotic or local breeds. With high accuracy of assigning the breed of origin to alleles, we may be able to introduce a resilient or adaptive haplotype into the crossbred cows. In livestock, we infer haplotypes from multigenerational pedigrees from which tracing BOA can be challenging. With the developed algorithm, alleles in crossbred animals could be accurately assigned a breed of origin without the need for a multigenerational pedigree. However, we acknowledge that our simulation is simpler compared to more complex crossbreeding schemes. For instance, in Girolando cattle, QTLs associated with 305-day milk yield were first identified, followed by an investigation of SNP-specific variances for the Holstein and Gir breeds of origin [21]. Despite the simplicity of our crossbreeding scheme, we believe our algorithm is capable of handling more complex scenarios, such as cases where sires are also crossbred [21]. In such situations, the algorithm would classify the sires as 'crossbred' and include them as candidate animals for breed-of-origin analyses. The breed of origin assignment would be for the sires themselves, instead of the crossbred cows. This process would then retrospectively trace the ancestry across the specified number of generations to assign breed of origin.

It is important to acknowledge that the African dairy cattle populations are characterized by extensive crossbreeding involving many breeds of Taurine and Indicine origin. This broad genetic diversity may challenge the accurate estimation of SNP effects despite the accurate assignment of BOA. While the BOA method relies on the recent local ancestry for each SNP marker allele, it ignores historical ancestry, which is important for estimating SNP marker effects across many breeds with different genomic histories. Furthermore, the BOA method does not take full advantage of linkage information (correlation between nearby SNP markers) and does not fully reflect the underlying genomic history of a study population [22]. Future studies developing algorithms and methods that consider the BOA and the genomic history of individuals and that would work for any level of crossbreeding and admixture in a population will be needed.

Conclusions

The developed algorithm assigns a breed origin to alleles with an accuracy of 99% in admixed animals from a crossbreeding program designed to mimic breeding programs in the LMICs. The algorithm is straightforward in its application and does not require prior knowledge of pedigree and relationships between crossbred and purebred animals, making it relevant and applicable in breeding programs practised in LMICs. However, it should be noted that the algorithm was developed and tested on simulated data. Further studies are required to test and apply the algorithm on real data.

Supplementary Information

Additional file 1.Additional file 2.Additional file 3.Additional file 4.