Genome wide association studies in yam reveal the challenge of high heterozygosity
Paterne A. Agre, Kwabena Darkwa, Iseki Kohtaro, Ryo Matsumoto, Asrat Asfaw

TL;DR
This study explores how genetic diversity in yam affects tuber yield and identifies key genetic markers linked to high yields.
Contribution
The study introduces a PIC-based approach to genomic selection in highly heterozygous yam genomes.
Findings
High-PIC markers had greater influence on trait associations in white Guinea yam.
Twelve stable SNPs were significantly associated with fresh tuber yield.
Heterozygosity-rich regions were linked to high-yielding genotypes.
Abstract
Yam (Dioscorea spp.) is a herbaceous vine crop valued for its starchy tubers, which are rich in essential nutrients. Its genome is highly heterozygous, contributing to considerable genetic diversity and adaptability. Understanding the polymorphism information content (PIC) of genetic markers is critical for enhancing key agronomic traits such as yield. In this study, we conducted a genome-wide association analysis that accounts for heterozygosity to investigate fresh tuber yield variation in white Guinea yam (Dioscorea rotundata Poir). A total of 173 genotypes including 86 elite breeding clones, 77 genebank accessions, and 10 farmer varieteies were genotyped through whole-genome resequencing, yielding approximately 1.6 million single nucleotide polymorphism (SNP) markers. Association analysis was performed using a multi-locus mixed linear model (MLM), incorporating kinship matrices…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8- —https://doi.org/10.13039/100000865Bill and Melinda Gates Foundation
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSeed and Plant Biochemistry · Genetic and Environmental Crop Studies · Agricultural pest management studies
Introduction
Yam (Dioscorea spp.) is a staple crop, significantly contributing to the dietary needs of millions across Africa, the Pacific and the Caribbean^1–3^. This crop is an especially valued food source that is highly socio-culturally relevant in Africa^4^. The current global production of yam stands at approximately 88 million metric tons, of which Africa accounts for about 98%^5^. Its achieved yield is on average around 8.5 tons per hectare, which is well below its potential yield of 50 tons per hectare^6^. Addressing this yield gap with state-of-the-art genetic improvement that employs effective, efficient and precise methods is imperative to enhance food security, particularly in the face of climate change and population growth challenges. Classical breeding programs for yam are impeded by crop’s complex genome, long breeding cycles, and high heterozygosity^7^. Genome-wide association study (GWAS) is among the powerful tools that revolutionized our understanding of the genetic basis of complex traits in various organisms and help to advance breeding efforts by facilitating the development of improved varieties with greater efficiency^8^. GWAS studies leverage on the natural variation in populations to identify genetic loci associated with phenotypic traits, offering insights into the genetic architecture underlying these traits^9^. However, highly heterozygous plant species such as yam, wherein individual plants harbor different alleles at a given locus, significantly influence the power of GWAS, both in terms of the accurate detection of the relevant genetic markers associated with key yam traits and interpreting the underlying genetic mechanism^7,10,11^. In populations with extensive heterozygosity, disentangling the effects of individual genetic variants on traits is challenging, particularly if epistatic interactions (alleles potentially interacting in non-additive ways) are prevalent^12^. This is particularly important for complex traits such as yield, which is highly influenced by multiple genetic factors^13^. Yield in yam is a complex trait influenced by numerous small-effect variants^14^. This inherent complexity necessitates refined methodology within GWAS to detect the trait-linked markers accurately. A series of studies have shed light on the potential of GWAS in the identification of genetic variants linked to key traits in yam^15–18^. However, none of the previous studies critically looked at the influence of heterozygosity on GWAS accuracy.
Given the challenge of correctly modeling the relationship between phenotype and genotype with GWAS, especially when heterozygosity is prevalent in the genome, it is crucial to design studies that maximize their potential to uncover meaningful genetic associations^20^. One approach is to stratify the population based on levels of heterozygosity, allowing for more targeted analyses that consider the genetic background of individuals. This stratification can help identify subsets where genetic effects are more easily detectable, reducing the noise introduced by complex genetic interactions^21^. Additionally, genotype imputation, population selection, Single Nucleotide Polymorphism (SNP) selection, and multi-population are the common strategies used to address the impact of heterozygosity during the GWAS analysis^22,23^.
In this study, we used SNP selection focusing on SNPs with varying levels of heterozygosity and employed a Mixed linear model (MLM) to perform trait association analysis. MLMs account for population structure and relatedness by incorporating kinship matrices derived from genetic markers. Heterozygosity influences these models, and using heterozygosity-based kinship matrices can improve their accuracy^23^. The main objective of this study is to assess the impact of the heterozygosity on the GWAS accuracy and the genome prediction for a complex trait like tuber yield in yam.
Results
Variation in fresh tuber yield per plant
The analysis of variance revealed highly significant differences (p < 0.01) among the genotypes, year and genotype by year interaction effect for tuber yield per plant (Table 1). Averaged across the year, tuber yield per plant varied from 0.117 to 2.617 with a mean of 1.05 kg plant^-1^ (Supplementary Fig. 1). Broad sense heritability for tuber weight per plant was high (0.66).Table 1. Mean square and heritability estimate for tuber yield per plant in the association panel evaluated across 2 years.TraitMean squaresError means squareMeanRangeH^2^Genotype (G)Year (S)G × STuber yield per plant (kg)0.563.000.26**0.191.052.230.66H^2^: Broad sense heritability, G x S: genotype by year interaction effect, **: significant at 1% probability level.
Summary statistics for marker data
Of the 136,000 SNP markers retained after filtering, the expected heterozygosity varied from 0.1 to 0.5, while the observed heterozygosity varied from 0.1 to 0.457. The Minor allele frequency ranged from 0.05 to 0.5, and the polymorphism information content varied from approximately 0.1 to 0.42, with 0.28 as an average. Markers are unequally distributed across the 20 yam chromosomes (Supplementary Table 1). The number of SNP markers per chromosome varies from 1825 to 13,992, with the highest density on chromosome 9 (Supplementary Fig. 2).
Genetic diversity and population structure
The genetic distances ranged from 0.06 to 0.36, and the highest genetic distance was reported between elite breeding clones and farmers varieties. The phylogenetic analysis of 174 Dioscorea rotundata accessions revealed clear clustering patterns corresponding to elite breeding clones, genebank accessions, and farmers’ varieties. Elite clones (red) were primarily grouped into two main clusters, indicating limited genetic divergence within this group. In contrast, genebank accessions (blue) showed a wider distribution across the tree, reflecting a higher level of genetic diversity. Farmers’ varieties (green) were distributed across several branches, occupying intermediate positions between elite clones and genebank accessions. A few accessions from different categories clustered closely together, suggesting potential genetic overlap among groups. Notably, the root of the tree was primarily formed by genebank accessions (Fig. 1).Fig. 1. Phylogeny tree-based IBS matrix displaying the relationship among elite breeding clones, genebank accessions, and farmers’ varieties.
Population structure analysis indicated potential grouping at k = 2 and k = 3 using the Bayesian Information Criteria (BIC) (Fig. 2). With a probability ancestry threshold of 70%, 157 genotypes were successfully assigned to three subpopulations at k = 3. The remaining genotypes, which had an ancestry probability below 70%, were considered admixed, reflecting genetic contributions from multiple subpopulations. At k = 3, subpopulation 1 (red) consisted of farmers varieties with no admixed genotypes. Subpopulation 2 (blue) included genebank accession, among which some genotypes had an ancestry probability of less than 70% and were considered as admixt. Subpopulation 3 (green) comprised of elite breeding clones and displayed the highest level of admixture (Fig. 2).Fig. 2. Population structure at K = 2 (a) and 3 (b). The color is associated with each cluster, and each bar represents the genotype.
Effect of markers polymorphism on the population structure
The population structure-based polymorphism level revealed a high impact on the population stratification (Fig. 3). Makers with low PIC displayed inadequate or inability to cluster properly compared to makers with high PIC (Figs. 3a). Using markers with a high PIC level, the 173 genotypes were grouped in three well-defined clusters (Fig. 3b).Fig. 3. Heatmap clustering based PIC. a: Population stratification using markers with low PIC < 0.2 (no clear population structure) and b: Population stratification using markers with high polymorphism information > 0. 2 content (clear population structure observed).
Trait-association for tuber yield per plant
We investigated the association between genetic markers and a specific trait using six different methods (mrMLM, FASTmrEMM, ISIS EMBLASSO, pLARmEB, pKWmEB, and FASTmrMLM). We employed both a naive model and a kinship-based model (KQ model) to identify relevant Single Nucleotide Polymorphisms (SNPs).
- Naive Model
In the naive model analysis, we identified 15 SNP markers distributed across 11 chromosomes (Fig. 4). These markers had high minor allele frequency (MAF) and high LOD scores but explained a low proportion of the total phenotypic variation (Table 2).Fig. 4. Manhattan and QQ plots of the tuber yield per plant using Naïve and K + Q models (The red and blue dots are for makers with significant LOD above the threshold.Table 2. Summary statistics of SNP markers link with the tuber yield per plant using different models and genetic methods.ModelsMethodsMarkersChrPosLOD-log10(P)r2 (%)MAFAllelesNaïvemrMLMchrom_07_25584966725,584,9667.59148.47261.340.1433Cchrom_09_654387496,543,87410.634311.58542.500.345Cchrom_15_183623331518,362,3337.48948.36781.230.3509Tchrom_19_9509263199,509,2636.17517.0142.450.1228Tchrom_19_240006991924,000,6998.39179.29361.800.3889Tchrom_20_156945642015,694,5646.41847.26523.50.3041TFASTmrMLMchrom_01_439022214,390,22211.680212.65091.280.1503Tchrom_01_28700972128,700,9726.79677.65520.230.2775Cchrom_06_831740368,317,4038.89399.8078Na0.4653Achrom_06_926028369,260,2837.90638.7959Na0.2572Cchrom_09_21942129921,942,1296.87977.7407Na0.2254Achrom_10_189409991018,940,99910.214511.1572Na0.3526Cchrom_19_9509263199,509,26316.73617.7823Na0.1243Tchrom_20_135108282013,510,8286.6767.5308Na0.4769Cchrom_20_156945642015,694,5646.19347.0329Na0.3064TFASTmrEMMAchrom_08_12524719812,524,71910.08911.029112.02170.1618Tchrom_10_233639991023,363,9996.36317.20814.27810.1387Tchrom_16_232367781623,236,7787.42768.30435.34320.1358GpLARmEBchrom_07_25584966725,584,9666.67977.5346Na0.1416Cchrom_09_654387496,543,87412.819913.8101Na0.3468Cchrom_14_4221541144,221,5416.14336.9811Na0.1156Gchrom_19_9509263199,509,2636.95297.816Na0.1243Tchrom_06_831740368,317,4036.12676.964Na0.4649Achrom_14_4221541144,221,5418.71419.6238Na0.117Gchrom_19_9509263199,509,2636.32947.17330.1228TBLASSOchrom_14_4221541144,221,5416.64447.4982Na0.1156Gchrom_19_9509263199,509,2636.3687.2132Na0.1243TK1 + QmrMLMchrom_07_26786068726,786,0688.69559.6047Na0.1462Achrom_09_19610000919,610,0008.68169.5905Na0.1316Tchrom_10_233639991023,363,9996.70517.5608Na0.1374Tchrom_11_179798981117,979,8988.39939.3014Na0.4503AFASTmrMLMchrom_06_926028369,260,2839.947710.8849Na0.2572Cchrom_11_117077481111,707,7487.81968.7069Na0.4335Tchrom_14_4221541144,221,5416.69597.5513Na0.1156Gchrom_19_9509263199,509,2638.84319.7558Na0.1243Tchrom_20_240566042024,056,6047.55768.4379Na0.3179TFASTmrEMMAchrom_06_926028369,260,2838.58269.48917.41360.2572CpLARmEBchrom_09_654387496,543,8746.35387.1985Na0.3468Cchrom_19_9509263199,509,2636.92947.7918Na0.1243Tchrom_01_13919773113,919,7738.339.2304Na0.4737Tchrom_09_12686050912,686,0506.49717.3464Na0.2982Achrom_20_240566042024,056,60411.095512.0555Na0.3158Tchrom_20_3216465203,216,4656.21327.0533Na0.2398CISIS EM-BLASSOchrom_06_926028369,260,2839.775810.7094Na0.2572Cchrom_14_4221541144,221,5416.12976.9671Na0.1156Gchrom_20_240566042024,056,6046.97647.8402Na0.3179TK2 + QmrMLMchrom_02_19256555219,256,5557.27748.1499Na0.1374Cchrom_08_479592084,795,9207.087.9469Na0.1053Tchrom_09_29476318929,476,3187.56478.44525.79640.2895Tchrom_12_4768042124,768,0426.45957.30760.1345AFASTmrMLMchrom_11_117077481111,707,7487.70988.59420.4335Tchrom_14_4221541144,221,5417.58828.46930.1156Gchrom_15_183623331518,362,3336.36647.21150.3526Tchrom_19_9509263199,509,26313.495614.49660.1243TFASTmrEMMAchrom_20_3216465203,216,46510.526911.475810.36580.237CpLARmEBchrom_09_654387496,543,8746.35387.1985Na0.3468Cchrom_19_9509263199,509,2636.92947.7918Na0.1243Tchrom_01_13919773113,919,7736.85417.7143Na0.4737Tchrom_09_12686050912,686,0508.23279.1306Na0.2982Achrom_20_240566042024,056,60411.255612.2186Na0.3158Tchrom_20_3216465203,216,4656.05546.8903Na0.2398CISIS EM-BLASSOchrom_06_926028369,260,2839.775810.7094Na0.2572Cchrom_14_4221541144,221,5416.12976.9671Na0.1156Gchrom_20_240566042024,056,6046.97647.8402Na0.3179TK3 + QmrMLMchrom_03_15679299315,679,2998.65329.5614Na0.1725CmrMLMchrom_03_353242233,532,4226.32697.1707Na0.2281AmrMLMchrom_08_500333885,003,3386.85257.71272.13010.2895AmrMLMchrom_09_654387496,543,8746.07076.9061Na0.345CmrMLMchrom_10_233639991023,363,9997.8518.7391Na0.1374TFASTmrMLMchrom_06_926028369,260,2836.34187.1861Na0.2572CFASTmrMLMchrom_09_654387496,543,8747.10577.9733Na0.3468CFASTmrMLMchrom_11_117077481111,707,7487.91798.8078Na0.4335TFASTmrMLMchrom_14_4221541144,221,5418.90179.8158Na0.1156GFASTmrMLMchrom_19_9509263199,509,2639.153610.0735Na0.1243TFASTmrMLMchrom_20_239962012023,996,2016.35997.2048Na0.2948CpLARmEBchrom_09_654387496,543,8746.35387.19850.3468CpLARmEBchrom_19_9509263199,509,2636.92947.79180.1243TpKWmEBchrom_17_131174471713,117,4477.07697.94370.35380.2456ApKWmEBchrom_20_240566042024,056,6048.72669.63660.3158TpKWmEBchrom_20_3216465203,216,4657.85668.74490.2398CpKWmEBchrom_20_3389269203,389,2697.87278.76140.3421CISIS EM-BLASSOchrom_06_926028369,260,2839.775810.70940.2572CISIS EM-BLASSOchrom_14_4221541144,221,5416.12976.96710.1156GK4 + QISIS EM-BLASSOchrom_20_240566042024,056,6046.97647.84020.3179TmrMLMchrom_08_500333885,003,3386.50567.35512.46590.2895AmrMLMchrom_09_654387496,543,8748.73439.64451.750.345CmrMLMchrom_18_111409651811,140,9657.5368.41571.71340.2661GFASTmrMLMchrom_06_926028369,260,2836.34187.18612.120.2572CFASTmrMLMchrom_09_654387496,543,8747.10577.973311.20.3468CFASTmrMLMchrom_11_117077481111,707,7487.91798.80782.450.4335TFASTmrMLMchrom_14_4221541144,221,5418.90179.81581.340.1156GFASTmrMLMchrom_19_9509263199,509,2639.153610.07351.260.1243TFASTmrMLMchrom_20_239962012023,996,2016.35997.20481.090.2948CpLARmEBchrom_09_654387496,543,8746.35387.19856.270.3468CpLARmEBchrom_19_9509263199,509,2636.92947.791812.230.1243TpKWmEBchrom_17_131174471713,117,4477.07697.94378.230.2456ApKWmEBchrom_20_240566042024,056,6048.72669.63663.560.3158TpKWmEBchrom_20_3216465203,216,4657.85668.74497.170.2398CpKWmEBchrom_20_3389269203,389,2697.87278.76148,900.3421CISIS EM-BLASSOchrom_06_926028369,260,2839.775810.70942.120.2572CISIS EM-BLASSOchrom_14_4221541144,221,5416.12976.96711.560.1156GISIS EM-BLASSOchrom_20_240566042024,056,6046.97647.84023.60.3179T
- KQ Model with Different Kinship Matrices.
We then applied the KQ model with kinship matrices, incorporating varying levels of markers with different PICs alongside the population structure as covariates.
- Using a kinship matrix for markers with low PIC below 0.1, we identified 13 unique SNP markers spread across 9 yam chromosomes. All six methods consistently detected these markers. However, a significant discrepancy was observed between the expected and observed log-likelihood values (Fig. 4).
- When the kinship matrix included markers with PIC between 0.1 and 0.2, 14 unique SNPs were identified across 11 chromosomes. These yield-linked markers displayed high LOD scores but also exhibited a substantial inflation between observed and expected LOD values.
- Finally, incorporating the kinship matrix originating from high PIC as a covariate in the KQ model analysis revealed 12 unique markers associated with tuber yield (Table 2). These markers had high MAF and explained a significant portion of the phenotypic variance. Interestingly, the markers identified using a high heterozygosity kinship matrix showed a strong correlation between the expected and observed LOD values (Fig. 4). It is worth noting that several markers were identified across the six genetic models (Fig. 5). Fig. 5. Network Analysis displaying common markers across the six genetic models.
Markers effect and yield prediction
For this only model K4 + Q model was used for the yield prediction. Our analysis identified three markers (M5, M8, and M9) from the K4 + Q model as non-informative for predicting yield across all yam genotypes (Fig. 6). The remaining nine markers displayed high segregation classifying the 173 yam genotypes into high-yielding and low-yield (Fig. 7). Analysis of the 9 SNP markers revealed two distinct clusters among the evaluated yam genotypes. These clusters corresponded to low-yielding and high-yielding genotypes, as indicated by the color key. Interestingly, the high-yielding cluster was predominantly composed of elite breeding clones.Fig. 6. Heatmap displaying the effect of the 12 markers associated with the tuber yield of white yam identified through the K4 + Q model.Fig. 7. Heatmap displaying the effect of the nine discriminant SNP markers associated with the tuber yield of white yam identified through the K4 + Q model.
Gene annotation and putative gene function
Using the K4 + Q model and annotation analysis, we identified ten genes in Table 3 potentially linked to tuber yield. These genes include those related to protein kinase activity (IPR000719, IPR011009, IPR002290, IPR017441) and Leucine-rich repeat domains (IPR006553). Interestingly, genes associated with the Brevis and Transcription factor BREVIS RADIX, N-terminal domain (IPR027988) were also found on chromosome 20 (Table 3).Table 3. List of putative genes associated with tuber yield with their respective function.MarkersChrGene PosGene IDPutative gene/enzymeFunctionschrom_06_926028369,250,888IPR000719Protein kinase domainEnhancing the tolerance of plant to drought and water stress thereby improving the yieldchrom_08_500333885,000,118IPR011009Protein kinase-like domainchrom_09_500333895,001,008IPR006553Leucine-rich repeat, cysteine-containing subtypeIt confers resistance virus in potato thereby enhancing tuber development and yieldchrom_11_117077481111,705,276IPR002290Serine/threonine- / dual specificity protein kinase, catalytic domain- Participate in the process of starch and sugar biosynthesis in potatoes and stimulated glucose pyrophosphorylase- Stimulate some enzymes in the starch biosynthesis pathways in potato and wheatchrom_14_4221541144,221,005IPR017441Protein kinase, ATP binding siteEnhancing the tolerance of potato plant to drought and water stress thereby improving the yieldchrom_19_9509263199,508,211IPR0135591Brevis radix (BRX) domainIt regulates cell proliferation and elongation in the rootIt involves in root growth in Arabidopsischrom_20_240566042024,052,045IPR027988Transcriptional factor BREVIS RADIX, N-terminal dormainCharacterized as being a transcription factor in plants regulating the extent of cell proliferation and elongation in the growth zone of the rootInvolved in cytokinin-mediated inhibition of lateral root initiation in ArabidopsisPromotes shoot growth in Arabidopsis
Discussion
In this study, we employed six different genetic models in the multi-random mixed linear (mrMLM) GWAS model to identify SNPs that were significantly associated with tuber yield per plant in White Guinea yam, considering various levels of heterozygosity.
Phenotypic yield variation and population structure
The observed variation in yield among the yam genotypes highlights the potential for genetic improvement through breeding programs. The population structure analysis through the admixture and clustering revealed the presence of three well-distinct clusters, suggesting the presence of genetic variation that can be exploited for breeding purposes. Traits association analysis relies on a population panel with a high diversity. Similar trending results were obtained in the previous study conducted on trait association by^15,19,24^.
Marker polymorphism and trait association
We performed trait association analyses by evaluating different levels of marker polymorphism using a range of genetic models from the naïve (uncontrolled) model to the more robust K (kinship) + Q (population structure) model. Our results revealed that the influence of population structure on marker-trait associations was strongly affected by the level of polymorphism, as reflected by the Polymorphism Information Content (PIC). Markers with higher PIC values tend to capture greater allelic diversity, thereby increasing the potential to detect rare alleles within genetically heterogeneous populations. Such alleles, although infrequent, can have significant effects on complex traits and are particularly valuable in breeding programs aiming to exploit novel genetic variation. As demonstrated in other yam studies by Shental et al.^25^ and Cormier et al.^26^, rare alleles identified through high-PIC markers are often associated with important agronomic traits, including yield components and stress adaptation.
In yam, several GWAS efforts have successfully identified quantitative trait loci (QTL) linked to tuber yield and related traits. For instance, Cormier et al.^26^) identified yield-associated loci in D. alata using GBS-based SNPs, while Bredeson et al.^27^ highlighted structural variation contributing to tuber size and shape. Similarly, Shenta et al.^25^ emphasized the role of population structure and heterozygosity in trait discovery, suggesting that careful control for these factors is essential for accurate association mapping. In our study, multiple SNP markers were significantly associated with tuber yield per plant under both naïve and controlled models. Among the tested models, the K + Q model showed the best fit, as illustrated by the Q-Q plots, which indicated improved control of false positives and a closer alignment between observed and expected LOD scores.
The benefits of maintaining heterozygosity in yam populations are particularly significant in the context of GWAS. As highlighted by Jia et al.^28^ and supported by our findings, high heterozygosity enhances the power of association studies by providing a broader range of alleles for analysis. This is crucial for dissecting the polygenic nature of yield and for uncovering QTLs with both large and small effects. The SNP markers identified in this study as significantly associated with tuber yield are promising candidates for marker-assisted selection (MAS) and genomic prediction efforts. They offer valuable targets for breeding programs focused on improving productivity in D. rotundata, especially under diverse environmental conditions.
Gene annotation and candidate genes
Functional annotation of the identified SNP markers revealed genes potentially involved in various processes influencing yield, such as starch biosynthesis, nutrient uptake, and stress tolerance^29,30^. Functional annotation of these genes can illuminate the biological pathways underlying yield variation in yam. By effectively addressing the challenge of high heterozygosity, yam breeding can leverage GWAS to identify genes that control desirable traits such as yield; this will ultimately lead to the development of superior yam varieties, contributing to enhanced food security and improved livelihoods for yam farmers. Further validation through functional studies is necessary to confirm the role of these candidate genes in yam yield.
Conclusion and perspective
Using six genetic models (Multi-locus random-SNP-effect Mixed Linear Model, Fast multi-locus random-SNP-effect EMMA (FASTmrEMMA), Iterative Sure Independence Screening EM-Bayesian LASSO (ISIS EM-BLASSO), Polygenic-background-control-based Least Angle Regression plus Empirical Bayes (pLARmEB), Polygenic-background-control-based Kruskal–Wallis test plus Empirical Bayes (pKWmEB), Fast mrMLM (FASTmrMLM) with varying levels of PIC, we assessed the impact of PIC on population structure and its implications for trait association in yam. Several markers significantly associated with tuber yield per plant were identified in White Guinea yam (Dioscorea rotundata). Annotation analysis of these markers revealed multiple putative genes with diverse functions related to plant growth and biomass accumulation. The functional validation of these candidate genes through approaches such as gene expression profiling or gene editing will be essential to confirm their roles and strengthen the association between the identified SNPs and yield. Overall, this study advances our understanding of the genetic architecture underlying yield in yam and provides a solid foundation for developing trait-linked markers to support marker-assisted selection (MAS). By integrating population structure analysis, genetic diversity assessment, and GWAS, we highlight the critical role of PIC in improving complex traits like yield, paving the way for more efficient and targeted yam breeding strategies.
Materials and methods
Genetic material and plant phenotyping
A comprehensive study was conducted using 173 genotypes of D. rotundata, comprising 86 elite breeding clones, 77 genebank accessions, and ten farmers’ varieties. The detailed information regarding these genotypes has been previously documented by Darkwa et al.^31^. Over a span of 2 years, field experiments were carried for two consecutive years 2017–2018 and 2018–2019 at the International Institute of Tropical Agriculture (IITA/Ibadan Nigeria), situated at an altitude of 221 m and coordinates 07° 29.639″ N, 003° 54.092″ E. The experiment employed augmented row–column designs for the respective years. In the first year, 43 accessions were replicated four times, with the remaining accessions non-replicated in single-row plots consisting of two plants, spaced at one-meter intervals within rows, in plot areas of 2 m^2^. In the subsequent year, the trial utilized a single plant per plot arranged in a randomized block design, replicated three times. The experimental field was meticulously maintained weed-free through manual weed control throughout the growth phases of the plants. The phenotypic assessment of the accessions was focused on fresh root yield per plant at harvest, employing the yam ontology framework described by Asfaw^32^.
A two-stage analysis was conducted to generate the Best Linear Unbiased Estimates (BLUEs) for the GWAS using mixed linear model in Lme4 R package^33^ accounting for the different experimental designs used for each of the single trials.
The model used for the randomized block design: \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${Y}_{ik}=\mu + {G}_{i}+ {{R}_{j}+Row}_{k}+{Col}_{l}+ {\varepsilon }_{ikl}$$\end{document}
Where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${Y}_{ik}$$\end{document} = Phenotypic value, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu$$\end{document} = grand mean, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${G}_{i}$$\end{document} = effect of the ith genotype, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${R}_{k}$$\end{document} = effect of the kth replication, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varepsilon }_{ik}$$\end{document} = residual.
For the augmented design, the model was fitted by replacing the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${R}_{k}$$\end{document} with the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${B}_{j}$$\end{document} as the block effect in the following model:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${Y}_{ik}=\mu + {G}_{i}+ {B}_{j} {+ Row}_{k}+{Col}_{l}+ {\varepsilon }_{ikl}$$\end{document}The genotype was considered as fixed effect in the models and the best linear unbiased estimates (BLUES) were generated. The genotype was first fitted as a random effect in each of the models to estimate the heritability of tuber yield per plant according to the formula described by Cullis et al.^34^.
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${H}_{Cullis}^{2}=1-\frac{{V}_{\Delta }^{BLUP}}{2{\sigma }_{g}^{2}}$$\end{document}where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${V}_{\Delta }^{BLUP}$$\end{document} is the average standard error of the clonal genotypic BLUPs (best linear unbiased predictions) and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\sigma }_{g}^{2}$$\end{document} , is the clonal genotypic variance.
A linear mixed model was fitted in the second stage analysis as:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${y}_{ijk}=\mu + {G}_{i}+ {Y}_{j} + {(GY)}_{ij}+\upepsilon$$\end{document}where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${y}_{ijk}$$\end{document} is the adjusted means (BLUEs) of the ith genotype in year j obtained from the linear mixed model in stage 1, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu$$\end{document} = grand mean, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${G}_{i}$$\end{document} = main effect of the ith genotype, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${Y}_{j}$$\end{document} = main effect of the jth year, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${(GY)}_{ij}$$\end{document} = is the genotype i in year j interaction effect, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\upepsilon$$\end{document} = the error associated with the estimation the adjusted means with variance matrix assumed known from stage 1. The genotype was considered as fixed effect and a weighted-combined mixed model fitted using BLUES and weights obtained from the first stage analysis to obtain the adjusted means of each genotype^35^.
DNA extraction and SNP calling
DNA extraction and SNP calling processes were conducted at the Iwate Biotechnology Research Center (IBRC-Japan). Lyophilized leaves were sent for DNA extraction, library construction, and whole-genome resequencing. Total genomic DNA was extracted from the leaf samples using a NucleoSpin Plant II Kit following the manufacturer’s protocol (MachereyNagel GmbH & Co), with minor adjustments.
The generated paired-end sequencing reads in fastq format were aligned to the D. rotundata reference genome version 2 using Hisat2. SAM files were converted to BAM format and sorted by name using SAMtools. In cases where multiple sequencing samples originated from the same biological clone, the sorted BAM files for each clone were merged using SAMtools. Duplicate reads were identified and read groups were added using the Picard package (v2.17.5).
GATK (v3.8–0) was employed for indel realignment, variant calling (HaplotypeCaller in gVCF mode), and joint genotyping (GenotypeGVCFs). The resulting VCF file was filtered based on minor allele frequency (MAF > 0.1) and absence of missing data at the genotype and SNP marker levels. Only bi-allelic SNP markers with a genotype quality > 20 and a read depth > 5 were retained post-filtering using vcftools and plink. Subsequently, the SNPs underwent linkage disequilibrium (LD) pruning (LD pruning was conducted with PLINK using –indep-pairwise 50 5 0.5 to reduce redundancy) with parameters including a window size of 50 bp in SNPs, a step size of 5 for shifting the window, an R-square value of 0.5, resulting in the retention of 136,429 SNP markers for all subsequent analyses.
Genetic diversity and population structure analysis
Various population genetic analysis methods were conducted to explore the structure and level of genetic diversity in the study material. The SNP distribution and the density were estimated using the ‘Cmplot’ function implemented in the CMplot R package^36^. Summary statistics such as the minor allele frequency (MAF), the observed heterozygosity (H), and the polymorphism information content (PIC) were estimated using the function "–freq" and "–hardy" using PLINK V1.90^37^. To understand the relationship among and between genebank accession, farmers varieties and elite breeding clones, we conducted diversity analysis using phangorn library package,.
The Structure software version 2.3.3 was used to cluster samples into populations. Structure simulations were conducted using an admixture model with a burn-in period of 20,000 iterations and a Markov chain Monte Carlo (MCMC) set at 20,000 iterations. The simulations were repeated three times for K-values ranging from 1 to 10. The optimal subpopulation model was determined by applying the informal pointers (e.g., geographical origin) proposed by Pritchard et al.^38^ and Falush et al.^39^ considering ΔK, a second-order rate change with respect to K, as defined by Evanno et al.^40^ and implemented in STRUCTURE HARVESTER. The most likely value of K was then determined, and the Structure population was plotted using the barplot function in R. The phylogenetic tree was constructed using the ape version 5.0 package in R.
Genome-wide association study
Genome-wide association studies (GWAS) were conducted using six multi-locus models implemented in the R package mrMLM v4.0.2^41^. These models included:
- Multi-locus random-SNP-effect Mixed Linear Model^42^
- Fast multi-locus random-SNP-effect EMMA (FASTmrEMMA)^43^
- Iterative Sure Independence Screening EM-Bayesian LASSO (ISIS EM-BLASSO)^44^
- Polygenic-background-control-based Least Angle Regression plus Empirical Bayes (pLARmEB)^41^
- Polygenic-background-control-based Kruskal–Wallis test plus Empirical Bayes (pKWmEB)^43^
- Fast mrMLM (FASTmrMLM)^44^
To account for false positive discovery, we incorporated a population structure matrix (Q) and different kinship matrix generated based on different markers heterozygosity levels (Fig. 8). The Kinship matrices were generated using the VanRaden method in GAPIT. The mrMLM package (v4.0.2) was used to estimate the percentage of phenotypic variation explained by each associated marker (R^2^) and their corresponding effects (https://cran.r-project.org/web/packages/mrMLM/index.html).Fig. 8SNP Filtering and graphical display of GWAS approach.
Gene annotation and putative gene prediction
Putative genes closely linked to the trait markers were investigated using the GFF3 file generated from the yam reference Genome v2^10^. The genes were searched in the region of 500Kbp, and their respective function was documented using the NCBI and INTERPRO websites for gene ontology documentation. To validate the maker’s ability to predict the tuber yield as an in-vivo method, we considered all markers as factors, and a heatmap was defined, while the marker effect was detected based on color gradient.
Supplementary Information
Supplementary Information.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Food and Agriculture Organization of the United Nations “Yam production – FAO” . Food and agriculture organization of the United Nations, “production: Crops and livestock products” [original data]. Retrieved August 11, 2024 from https://ourworldindata.org/grapher/yams-production (2023).
- 2Li Lin-Yin C Mplot: Circle manhattan plot_. R package version 4.5.1, https://CRAN.R-project.org/package=C Mplot (2024).
- 3Zhang, Y., Wang, J., Li, P., Zhang, Y. mr MLM: Multi-Locus random-SNP-Effect mixed linear model tools for GWAS_. R package version 5.0.1, <https://CRAN.R-project.org/package=mr MLM (2022).
