Cutting through the clones: genomic strategies for core collection development in moso bamboo
Rui Gu, Songpo Wei, Shaohui Fan, Martha Rendón-Anaya, Guanglu Liu

TL;DR
This study identifies the best strategy to create a diverse and representative core collection of moso bamboo using genomic data from 432 accessions.
Contribution
The study introduces a novel stratified sampling strategy (S-HE20) that optimizes genetic diversity in core collections of moso bamboo.
Findings
Moso bamboo has three distinct genetic clusters and a hybrid group based on genomic analysis.
The S-HE20 strategy maximized genetic diversity and captured broad phenotypic variation in the core collection.
The resulting core collection includes 84 accessions from 15 geographic regions.
Abstract
Conserving genetic diversity is crucial for effective germplasm use and crop improvement. Developing core collections with minimal redundancy and maximum diversity requires a clear understanding of population structure. However, the nationwide population structure of moso bamboo (Phyllostachys edulis) remains poorly characterized, creating a major gap for developing representative, non-redundant core collections. Using whole-genome resequencing data from 432 moso bamboo accessions covering a broad geographic range across the distribution of the species in China, we investigated the population genetic structure and diversity patterns. Principal component analysis and phylogenetic tree analyses identified three distinct genetic clusters together with a hybrid group. To identify the optimal strategy for core collection development, we evaluated two stratification schemes, seven sampling…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3- —Swedish University of Agricultural Sciences
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBamboo properties and applications · Genetic diversity and population structure · Genetic and phenotypic traits in livestock
Background
Moso bamboo (Phyllostachys edulis (Carrière) J. Houz.) is a fast-growing and highly valuable non-timber forest species, widely utilized in construction, furniture, papermaking, and handicrafts due to its excellent mechanical properties [1]. It accounts for approximately 70% of China’s total bamboo cultivation area, covering around 5.28 million hectares [2]. Moso bamboo plays a crucial role in the national bamboo industry, which was valued at over USD 21 billion in 2023 [3]. Its wide distribution and predominantly clonal reproduction have led to extensive cultivation but also increased vulnerability to climate change and habitat fragmentation. Despite its economic significance, current germplasm resources are large but poorly characterized, and several key gaps limit their effective use: unclear population structure, lack of well-defined core collections, and a narrow genetic base in breeding programs. These limitations constrain the sustainable growth of the bamboo industry.
The concept of a core collection—a representative subset of accessions that retains maximum genetic diversity with minimal redundancy—was first proposed by Frankel [4]. Such collections provide efficient entry points for genetic studies and breeding, and have been successfully applied in QTL mapping, genetic mapping, and gene discovery in crops including soybean [5, 6], maize [7], Chinese cabbage [8], and jujube [9]. Developing a core collection generally comprises four main steps: (1) data acquisition (molecular and/or phenotypic), (2) accession grouping based on genetic or ecological criteria, (3) strategic sampling, and (4) evaluation of diversity metrics such as allele richness or phenotypic variance [10, 11]. Despite their utility in crop breeding, core collections remain underutilized in forest trees—especially clonal species like bamboo. Existing efforts face three key limitations: (1) most rely solely on either phenotypic traits or molecular markers, which only partially capture the underlying genetic variation [12, 13]; (2) non-stratified random sampling is still commonly used, potentially overlooking population substructure [14, 15]; (3) traditional algorithms such as the M-strategy, PowerCore, and GenoCore often optimize a single objective (e.g., maximizing allelic richness), lacking flexibility to balance multiple goals such as genetic distance, trait variation, and rare allele retention [16–18]. Recently, Core Hunter v3 has been widely adopted as an effective tool for multi-objective core collection optimization, offering flexible metrics and diverse sampling strategies [19]. It incorporates both distance-based metrics—such as entry-to-nearest-entry (E-NE) and accession-to-nearest-entry (A-NE)—and diversity indices such as Shannon’s diversity index (SH) and expected heterozygosity (HE), enabling simultaneous optimization of representativeness and diversity. The use of different genetic distance measures, including Modified Rogers’ Distance (MR) and Cavalli–Sforza and Edwards’ distance (CE), further refines sensitivity to allele frequency patterns and rare variants.
In this study, we conducted the first comprehensive effort to construct a core collection for moso bamboo. Using whole-genome resequencing data from 432 accessions representing a broad geographic range of the species’ distribution in China, we analyzed population structure and genetic diversity. We then evaluated multiple core collection strategies using Core Hunter v3, testing: (1) stratified versus non-stratified sampling, (2) a range of selection algorithms, and (3) different sampling proportions. Our work provides a scientific foundation for cost-effective conservation and enhanced utilization of moso bamboo genetic resources, while offering methodological guidance for core collection development in other clonally propagated forest species.
Results
SNP discovery
A total of 19.71 Tb of sequencing data were generated from 432 moso bamboo individuals, at a mean sequencing depth of 20.93 × (Additional file 1: Table S1). Sequencing reads were mapped to the chromosome-scale reference genome of moso bamboo, yielding an average mapping rate of 95.41% and a mean genome coverage of 92.06% (Additional file 1: Table S1). After quality control and variant filtering, 58.43 million high-quality SNPs were identified (Additional file 2: Table S2), corresponding to an average genome-wide SNP density of one SNP per 30 bp. Among these variants, 68.62% were located in intergenic regions, 5.79% in introns, and 1.33% in coding sequences (Additional file 3: Table S3). The transition/transversion ratio was 3.41, and the nonsynonymous/synonymous ratio for all biallelic coding SNPs was 1.43 (Additional file 3: Table S3).
Population genetic structure
Principal component analysis (PCA) identified four distinct genetic clusters (Groups A–D). The first principal component (PC1, explaining 72.08% of total variance) clearly separated Group D from the other groups, while the second component (PC2, explaining 8.24%) distinguished Group A from Groups B and C (Fig. 1a). The unusually high proportion of variance explained by PC1 reflects the high clonal redundancy and pronounced long-term lineage divergence characteristic of moso bamboo. A similar pattern of genetic divergence was observed in the maximum likelihood (ML) phylogenetic tree, supporting the PCA results (Fig. 1b). ADMIXTURE analysis further confirmed four genetic clusters (K = 4) as the optimal grouping, based on the lowest cross-validation error (Fig. 1d). At K = 2, individuals were broadly separated into western (Group A) and eastern (Group D) lineages. At K = 3, a third cluster appeared, largely derived from the western lineage. At the optimal K = 4, a distinct ancestral component emerged almost exclusively within Group D, where it constituted the dominant proportion of ancestry. This Group D–specific component, represented by the deep-blue segment in Fig. 1e, corresponds to accessions primarily distributed in the southeastern coastal region and reveals substructure not captured by PCA or the ML phylogeny. Notably, individuals from Group C (hybrid group, 7 individuals) exhibited strong ancestral admixture from multiple clusters, suggesting genetic complexity and a possible hybrid origin. Geographically, Group D was primarily distributed in southeastern coastal regions, Group A in the southwestern highlands (transition zone between the Yunnan–Guizhou Plateau and western Sichuan Mountains), and Groups B and C in the north and northwest, including northern Guizhou, central to northern Sichuan, and southern Shanxi (Fig. 1c). Despite this, the overall genetic structure lacked a clear phylogeographic pattern, possibly due to China’s complex topography limiting historical gene flow, and recent human-mediated clonal propagation and translocation, which likely disrupted the original population structure.Fig. 1. Population genomic analyses of moso bamboo. a Principal component analysis (PCA) performed in PLINK v1.90 using 329,747 LD-pruned SNPs (MAF ≥ 0.05). Colors represent inferred genetic clusters and shapes indicate geographic regions. b Maximum-likelihood phylogenetic tree reconstructed using IQ-TREE v2 with the MFP model and 1,000 bootstrap replicates; the outer ring indicates the four major genetic groups. c Geographical distribution of all 432 accessions across China, colored by inferred genetic cluster. d Cross-validation error values for K = 2–16 from ADMIXTURE v1.3.0, showing K = 4 as the optimal clustering. e Population structure inferred by ADMIXTURE v1.3.0 at K = 2–5 using the 329,747 LD-pruned SNPs; each vertical bar represents one accession
Genetic diversity
To assess genome-wide patterns of genetic diversity and differentiation, individuals from the hybrid group were excluded from the analysis. Among the remaining three genetic groups, Group D exhibited the highest levels of genetic diversity, as indicated by the highest observed heterozygosity (Ho = 0.4967), expected heterozygosity (He = 0.3216), polymorphism information content (PIC = 0.2563) and Shannon’s diversity index (I = 0.4789) (Table 1). In contrast, Group B showed the lowest diversity, with the lowest Ho (0.3281), He (0.2942), PIC (0.2339) and I (0.4013) (Table 1). Across all groups, Ho exceeded He, reflecting an excess of heterozygosity. This is consistent with the predominantly clonal reproduction of moso bamboo, which preserves heterozygous genotypes, with historical admixture and genome duplication further reinforcing this pattern.Table 1. Genetic diversity parameters of the three groupsGroupHoHePICIJ′GroupA0.40230.29750.23630.42710.7288GroupB0.32810.29420.23390.40130.7471GroupD0.49670.32160.25630.47890.7346Ho Observed heterozygosity, He Expected heterozygosity, PIC Polymorphism information content, I Shannon’s diversity index, J′ Genetic evenness
Core collection optimization, development and evaluation
To assess the effectiveness of core collection selection methods, we compared seven optimization strategies under both unstratified and stratified sampling approaches across five sampling proportions (10%, 15%, 20%, 25%, and 30%). Genetic diversity indices, including PIC, He, Ho, I, and J′, were used as evaluation metrics. Hybrid accessions were excluded from the stratified sampling analyses because they show admixed ancestry and do not form a distinct population cluster. In addition, their low number (n = 7) makes proportional sample allocation across groups infeasible at multiple sampling intensities, which would compromise the assumptions of stratified sampling. In contrast, these accessions were retained in the unstratified framework, where sampling is performed across the full dataset without relying on predefined group structure.
Overall, stratified sampling consistently outperformed unstratified sampling in capturing genetic diversity. For instance, at a 10% sampling intensity using the HE method, stratified sampling yielded a higher Ho (0.5160 vs. 0.4632) compared to unstratified sampling (Table 2). Notably, under the EN-MR method, He dropped from 0.3479 (stratified) to 0.2038 (unstratified), highlighting stratified sampling's advantage in preserving within-group variation and avoiding redundancy from dominant genotypes (Table 2). Under the stratified sampling strategy, we further assessed the performance of different sampling methods across varying sampling proportions. Among all stratified sampling combinations, the HE method at 10% and 20% sampling proportions consistently showed the highest genetic diversity and evenness. Specifically, at 10%, the HE method produced the highest Ho (0.5160) and J′ (0.7658), while at 20%, it achieved the highest values of He (0.3665), PIC (0.2904), and I (0.5302), reflecting effective representation of genetic variability. Based on these results, the HE method under stratified sampling, particularly at 10% and 20% sampling proportions, was identified as the most effective strategy. Therefore, the S-HE10 (Stratified sampling with HE method at 10% proportion) and S-HE20 (Stratified sampling with HE method at 20% proportion) subsets were selected as candidate core collections.Table 2. Genetic diversity parameters under different sampling strategiesSampling ratio %Sampling methodsStratified samplingUnstratified samplingHoHePICIJ′HoHePICIJ′10%AN_CE0.48520.35090.27950.51570.74440.40080.28260.15420.51270.7412AN_MR0.48530.35090.27950.51580.74450.40180.28300.15330.51300.7416EN_CE0.47570.34770.27710.51030.73670.38080.20390.26570.46910.6920EN_MR0.47620.34790.27720.51050.73700.38070.20380.26530.46910.6918ENMR0.5_SH0.50.49960.35720.28340.52280.75530.42240.32690.22990.53290.7692HE0.51600.36160.28620.52940.76580.46320.36120.30430.55050.7949SH0.51510.36130.28610.52920.76500.46320.36120.27250.55050.794915%AN_CE0.48230.36180.28760.52310.75460.40050.25700.10570.49120.7104AN_MR0.48180.36180.28760.52280.75430.40060.26190.11630.49580.7165EN_CE0.47580.35980.28600.51950.74950.40320.28550.16530.50660.7314EN_MR0.47290.35860.28520.51770.74690.40700.29650.18620.51160.7385ENMR0.5_SH0.50.49190.36590.29000.52760.76140.44770.35630.27200.54080.7803HE0.50420.36900.29210.53300.76910.46490.36540.28070.54800.7907SH0.50320.36850.29180.53240.76830.46490.36540.28070.54800.790720%AN_CE0.48670.36140.28720.52280.75420.40180.25860.11350.48880.7060AN_MR0.48660.36130.28720.52270.75410.40220.25960.11580.48880.7061EN_CE0.47930.35870.28530.51870.74830.42710.32490.23560.52020.7505EN_MR0.47850.35840.28510.51800.74730.41970.31830.22470.51940.7494ENMR0.5_SH0.50.49440.36480.28940.52670.75980.45530.36210.28040.54060.7800HE0.50360.36650.29040.53020.76500.46420.36530.28160.54680.7890SH0.50030.36480.28940.52840.76240.46420.36530.28160.54680.789025%AN_CE0.48440.36210.28740.52300.75450.42070.30820.20920.51490.7429AN_MR0.48450.36210.28740.52300.75460.42070.30820.20920.51490.7429EN_CE0.48050.36050.28620.52070.75120.43610.34550.26440.52850.7624EN_MR0.48100.36070.28640.52090.75150.43630.34410.26260.52860.7626ENMR0.5_SH0.50.49400.36560.28960.52760.76120.46250.36610.28620.54020.7794HE0.50020.36650.29020.52980.76440.46450.36610.28430.54470.7859SH0.50010.36650.29030.52990.76440.46450.36610.28430.54470.785930%AN_CE0.48510.36090.28720.52280.75430.43610.33850.25450.52750.7611AN_MR0.48480.36080.28710.52270.75410.43610.33850.25460.52760.7611EN_CE0.48240.35950.28610.52080.75140.44640.35560.27680.53080.7659EN_MR0.48140.35930.28590.52040.75080.45040.35590.27810.52930.7636ENMR0.5_SH0.50.49160.36320.28860.52590.75870.46210.36460.28500.53900.7776HE0.49810.36430.28930.52830.76220.46390.36620.28530.54250.7827SH0.49800.36420.28930.52830.76220.46390.36620.28530.54250.7827Ho Observed heterozygosity, He Expected heterozygosity, PIC Polymorphism information content, I Shannon’s diversity index, J′ Genetic evenness
To evaluate the phenotypic representativeness of the selected core collections (S-HE10 and S-HE20), we assessed four statistical metrics across 15 phenotypic traits. Both collections exhibited an MD of 0, indicating no significant difference in mean trait values compared to the entire collection and satisfying the criterion of MD < 20% (Table 3). S-HE20 achieved a CR of 82.32%, surpassing the 80% threshold, whereas S-HE10 reached 72.82%, falling slightly below (Table 3). In addition, S-HE20 demonstrated the best performance in variance-based metrics (VR = 101.72%, VD = 80.33%). Collectively, these findings validate S-HE20 as the optimal core collection, with comprehensive phenotypic representativeness and full compliance with selection standards. The final core collection comprises 84 accessions distributed across 15 geographic regions, including 13 from Group A, 2 from Group B, and 69 from Group D (Fig. 1, Additional file 4: Table S4).Table 3. Evaluation parameters of core collections subsetSampling methodsSampling Ratio%MD%VD%CR%VR%HE10%013.3372.8295.9820%080.3382.32101.72MD% Mean difference percentage, CR% coincidence rate of range, VD% variance difference percentage, VR% coefficient of variation change
Validation of core collections
The core collection exhibited a higher Ho, while other genetic indices (He, PIC, I, J′) and all phenotypic indicators (maximum, minimum, mean, CV) showed no significant differences compared to the original collection, indicating effective preservation of genetic and phenotypic diversity (Table 4, Additional file 5: Table S5). In the genomic PCA, the core collection preserves the underlying genetic structure of the full dataset. The phenotypic PCA, meanwhile, shows reduced point density, reflecting the removal of phenotypically redundant accessions: the accession number decreased from 432 to 84 (an 80.6% reduction) while maintaining the overall distribution of the original collection, as indicated by a high coincidence rate (CR = 82.32%) and a mean difference of zero (MD = 0%) (Fig. 2). For phenotypic data, the eigenvalues, contribution rates, cumulative contribution rates, and major contributing traits were highly consistent between the core and original collections. The cumulative contribution rates of the first three principal components were 70.90% and 67.61%, respectively, indicating that the construction of the core collection not only preserved the main genetic information but also effectively eliminated genetic redundancy, thereby improving the cumulative contribution rate (Table 5). Notably, the core collection also retained accessions located at the edges of the phenotypic PCA space, demonstrating that it preserved not only common types but also unique or extreme phenotypes present in the full collection (Fig. 2). Core collection accessions were distributed across different clades in the phylogenetic tree (Fig. 2). These findings collectively suggest that the constructed core collection provides a comprehensive and representative sampling of the genetic diversity present in the entire population and geographic range, as depicted in Fig. 3.Table 4. Genetic diversity parameters of original and core collectionsCollectionHoHePICIJ'Original0.4754^b^0.3552^a^0.2829^a^0.5160^a^0.7444^a^Core0.5036^a^0.3664^a^0.2904^a^0.5302^a^0.7650^a^Ho Observed heterozygosity, He Expected heterozygosity, PIC Polymorphism information content, I Shannon’s diversity index, J' Genetic evenness. The same letter denotes that genetic diversity parameters do not differ significantly between the original and core collectionsFig. 2Evaluation of the representativeness of the core collection. a PCA plots showing the distribution of core and original collections, with the left plot based on genotypic data (329,747 LD-pruned SNPs) and the right plot based on 15 phenotypic data. Different colored dots represent core and original accessions, respectively. b Phylogenetic tree of the core collectionTable 5Principal component analysis of phenotypic traits for original and core collectionsComponentOriginal collectionCore collectionEigen valueContribution percentage %Cumulative contribution percentage %The top three contributing traitsEigen valueContribution percentage %Cumulative contribution percentage %The top three contributing traitsPC12.3938.1038.10DBH,CCD,BD2.5041.5541.55DBH,CCD,BDPC21.8021.5359.63LA,LW,LL1.7320.0261.56LA,LL,LWPC31.097.9767.61CW,HFB,IL1.189.3370.90HFB,IL,CWDBH Diameter at breast height, CCD Culm cavity diameter, BD Basal diameter, LA Leaf area, LW Leaf width, LL Leaf length, CW Crown width, HFB Height to the first branch, IL Internode lengthFig. 3Geographical distribution of the original collection (blue, n = 432) and the selected core collection (red, n = 84) across the range of moso bamboo in China. Blue dots represent all sampled accessions, whereas red dots indicate those included in the final core collection
Discussion
Although previous studies have investigated the genetic diversity of moso bamboo, most were limited to major production regions. In this study, we performed an extensive analysis using whole-genome resequencing data from 432 accessions collected across the broad distribution of the species in China. We found that the moso bamboo germplasm comprises three distinct genetic clusters along with a hybrid group (Fig. 1). Among them, group D is the most widely distributed across the territory and exhibits the highest degree of genetic redundancy. We observed high levels of heterozygosity, a pattern that is consistent with previous findings in clonally propagated perennial species [20, 21]. The pervasive heterozygosity observed in moso bamboo is most likely driven by its long-term clonal propagation, which suppresses meiotic recombination and thus preserves ancestral heterozygous genotypes across generations. In addition, somatic mutations accumulated over time within and among clonal lineages contribute further heterozygous variants [22].
Recent progress in high-throughput sequencing technologies has greatly accelerated the management and utilization of plant germplasm resources by enabling precise and efficient detection of genetic variation at the genome-wide scale. However, with the continued expansion of germplasm repositories, challenges have emerged regarding accurate identification, efficient conservation, and the efficient utilization of germplasm. To address these issues, Frankel [4] proposed the concept of a “core collection”—a representative subset of accessions that retains the greatest possible genetic diversity with the least redundancy—thereby enhancing the efficiency of germplasm use. Core collections have since become a vital tool in genetic improvement and diversity conservation, with their “representativeness” heavily dependent on well-designed sampling strategies and algorithms. Previous studies have evaluated various core collection construction methods. Thachuk et al. [23] compared algorithms such as the D-method, MSTRAT, PowerCore, and Core Hunter, concluding that Core Hunter outperformed others in preserving both overall genetic diversity and rare alleles, even with smaller core collection sizes [24].
In this study, we systematically evaluated seven optimization strategies implemented in Core Hunter v3 (EN-MR, EN-CE, AN-CE, AN-MR, SH, HE, and multi-objective combination) under both stratified and unstratified sampling frameworks. Across all genetic diversity metrics—including PIC, He, Ho, I, and J′—stratified sampling consistently outperformed unstratified sampling. This supports Brown’s [25] hypothesis that ecological or geographical stratification can reduce redundancy and preserve subgroup-specific variation. The two-level design of “inter-group representativeness + intra-group diversity maximization” enabled dual optimization in both structural integrity and genetic coverage of the core collection.
Among all the strategies, expected heterozygosity (HE) demonstrated the most stable performance under the stratified framework, consistent with findings in wheat by Soleimani et al. [24]. At the 20% sampling intensity, the S-HE20 collection achieved the highest genetic diversity values (He = 0.366, PIC = 0.290, I = 0.530), indicating its superiority in maintaining within-locus variability and balanced allele frequencies. By contrast, S-HE10 achieved the highest observed heterozygosity (Ho = 0.516) and evenness (J′ = 0.766) at a lower sampling intensity. These preferences differ from those observed in potato by Manrique-Carpintero et al. [26], who favored AE-MR based on MR distance, suggesting that clonally propagated species like moso bamboo tend to retain intra-locus heterozygosity rather than maximizing inter-sample divergence. In moso bamboo, the predominantly clonal mode of reproduction preserves within-locus allelic variability and mitigates allele fixation, consistent with the excess heterozygosity observed across groups (Ho > He). This mechanism explains the strong performance of the S-HE20 collection in maintaining representative genetic diversity. In contrast, EN-based distance strategies prioritize maximizing divergence among sampled individuals, which can compromise population representativeness. This reflects a key trade-off in core collection development: divergence-oriented methods capture broader genetic distances, whereas HE-based approaches better retain population-level allelic balance in predominantly clonal species.
Further validation using phenotypic data confirmed the effectiveness of the S-HE20 core collection. It preserved the mean values of the full population (MD = 0) and captured 82.32% of the total phenotypic variation (CR), outperforming S-HE10 (CR = 72.82%). It also exhibited strong performance in capturing trait extremes, as shown by high variance-related indices (VD = 80.33%, VR = 101.72%). In terms of genetic diversity, S-HE20 maintained comparable diversity levels (He, PIC) to the original collection, while significantly increasing observed heterozygosity (Ho = 0.5036 vs. 0.4754), indicating improved within-locus variability. PCA and phylogenetic tree analyses further confirmed the S-HE20 core collection retained the overall population structure the of the original collection and captured marginal and unique accessions. These findings confirm its strong representativeness and broad genetic coverage, thereby providing a valuable resource for future genetic studies and breeding programs.
Conclusions
In this study, we constructed the first core collection of moso bamboo based on whole-genome resequencing data from 432 accessions covering a broad spectrum of its distribution range. By systematically evaluating different strategies, the optimal approach was identified as stratified sampling combined with the expected heterozygosity algorithm at a 20% sampling proportion, resulting in a core collection of 84 accessions. This approach effectively reduced genetic redundancy while maximizing both genetic diversity and phenotypic representativeness. The resulting core collection provides a valuable foundation for the accurate conservation, germplasm innovation, and molecular-assisted breeding of moso bamboo.
Methods
Sample collection
Due to the strong growth and clonal propagation ability of moso bamboo, which spreads via rhizomes, care was taken to avoid sampling genetically identical individuals. To maximize genetic diversity and avoid redundancy, a grid-based random sampling strategy was adopted. Sampling sites were established at the intersections of a 150 km latitude–longitude grid across the whole distribution range of moso bamboo in China. At each site, 3–6 bamboo individuals were sampled, ensuring a minimum distance of 1 km between individuals. Given the extensive human influence on moso bamboo distribution and the difficulty of distinguishing between natural and cultivated stands, only populations that, according to local forestry bureaus and bamboo farmers, had been established for over 20 years were included. In total, 432 moso bamboo individuals were sampled, covering 16 distinct geographic regions and representing the full range of moso bamboo habitats in China (Fig. 3). Detailed information on the collected accessions is provided in Additional file 1: Table S1. A schematic flow diagram summarizing the full workflow—from field sampling to quality control, population structure inference, candidate core construction, and final core selection—is provided in Additional file 6: Fig. S1.
DNA extraction and sequencing
Fresh young leaves were collected from second-year moso bamboo plants, desiccated with silica gel, and stored at − 80 °C prior to DNA extraction. Genomic DNA was isolated following the cetyltrimethylammonium bromide (CTAB) protocol [27]. DNA quality and purity were evaluated via 0.8% agarose gel electrophoresis, and concentrations were determined using a NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific, USA). Paired-end sequencing was carried out on the DNBSEQ-T7 platform (MGI Tech) at an average depth of 20 × per sample.
Variant calling and SNP Filtering
Adapter sequences and low-quality bases were trimmed from the raw sequencing reads using fqtools_plus [28]. Reads were discarded if more than 5% of the bases were “N” or if over 50% of the bases had a quality score < 20. The filtered high-quality reads were aligned to the moso bamboo reference genome [29] using BWA-MEM v0.7.12 [30] and the resulting alignments were sorted with Samtools v1.10 [31]. PCR duplicates were identified and removed using the MarkDuplicates module in Picard v2.2 (http://broadinstitute.github.io/picard). Variant calling for all 432 individuals was performed using the HaplotypeCaller tool in GATK v4.1.2 [32] with the “-ERC GVCF” option. The generated GVCF files were merged, and SNPs were jointly genotyped using GenomicsDBImport, GenotypeGVCFs, and SelectVariants modules in GATK v4.1.2. Raw SNPs were filtered using the following GATK hard-filtering criteria: QD < 2.0, MQ < 40.0, FS > 60.0, SOR > 3.0, MQRankSum < −12.5, and ReadPosRankSum < −8.0. To further improve data quality, the GATK-filtered SNPs were refined using bcftools v1.17 (https://www.htslib.org/doc/1.0/bcftools.html), removing loci with sequencing depth outside the acceptable range (mean ± SD, i.e., AVG(FMT/DP) < 10 or > 30) and retaining only biallelic SNPs (-m2 -M2). PLINK v1.90 [33] was then used to exclude SNPs with missing data rates > 10% and SnpEff v5.0 [34] was employed for SNP annotation. After filtering, 58,430,318 bi-allelic SNPs remained for downstream analyses. All command-line parameters and exact software settings used in this workflow are provided in the supplementary methods to ensure full reproducibility (Additional file 7: supplementary methods).
Population structure analysis
We first employed PLINK v1.90 [33] with the parameters “indep-pairwise 100 10 0.2” to obtain a linkage disequilibrium (LD)-pruned SNP set with a minor allele frequency (MAF) > 5%, resulting in 329,747 independent SNPs for population structure analyses. A maximum-likelihood (ML) phylogenetic tree was generated using IQ-TREE [35] with the options -m MFP -B 1000 -nt 60 and visualized in iTOL (https://itol.embl.de/). Principal component analysis (PCA) was performed in PLINK v1.90, and population structure was inferred using ADMIXTURE v1.3.0 [36] with the parameters –cv -B 100 -j 2 for K values ranging from 1 to 16. The resulting structure plots were visualized using the “pophelper” package in R (https://github.com/royfrancis/pophelper).
Genetic diversity analysis
To assess genetic diversity among the moso bamboo accessions, we calculated multiple indices, including observed heterozygosity (Ho), expected heterozygosity (He), polymorphic information content (PIC), Shannon’s diversity index (I), and genetic evenness (J′). The indices Ho, He, and PIC were computed using the R package “snpReady” (https://github.com/italo-granato/snpReady) [37], whereas I and J′ were calculated following the methods of Hennink [38] and Pielou [39], respectively.
Establishment of the core collection
For the construction of the core collections, two sampling strategies were employed: unstratified sampling, in which all accessions were treated as a single group, and stratified sampling, in which accessions were grouped based on genetic structure and phylogenetic relationships, and samples were proportionally drawn from each group. A total of seven sampling algorithms were tested under five sampling intensities (10%, 15%, 20%, 25%, and 30%) using the R package Core Hunter v3 (https://github.com/corehunter/corehunter3) [19]. The algorithms are described as follows:
- EN-MR (Entry-to-Nearest-Entry with Modified Rogers’ Distance): Aims to maximize the mean genetic distance from each selected accession and its closest neighbor within the core collection, based on Modified Rogers’ Distance (MRD). This strategy reduces redundancy and enhances intra-core diversity. MRD is sensitive to both common and rare alleles, making it suitable for detecting fine-scale genetic differentiation [19, 40].
- EN-CE (Entry-to-Nearest-Entry with Cavalli–Sforza and Edwards’ distance): Similar to EN-MR, this method also maximizes the pairwise distance within the core collection, but uses Cavalli–Sforza and Edwards’ distance (CSE), which emphasizes allele frequency differentiation and is less influenced by rare alleles, thereby reflecting broader diversity trends.
- AN-CE (Accession-to-Nearest-Entry with Cavalli–Sforza and Edwards’ distance): Minimizes the mean distance from each accession in the entire dataset to its closest representative in the core collection, using the CSE metric. This method emphasizes representativeness by ensuring that the core broadly captures the full genetic landscape [19, 40].
- AN-MR (Accession-to-Nearest-Entry with Modified Rogers’ Distance): A variant of AN-CE that uses MRD as the distance metric. It prioritizes representativeness while placing more emphasis on capturing subtle allelic variations, including low-frequency alleles.
- SH (Shannon’s diversity index): Selects accessions to maximize overall allelic richness by weighting rare alleles more heavily. This index increases when alleles are evenly distributed and uniquely represented, making it effective in preserving low-frequency alleles and maximizing diversity breadth [23, 41].
- HE (Expected heterozygosity): Aims to maximize the expected heterozygosity across loci, reflecting within-locus genetic variability. HE gives equal weight to all loci and favors collections that retain heterozygosity and avoid allele fixation, thus promoting overall genetic variability [21, 42].
- Multi-objective optimization: Combines EN-MR and SH objectives with equal weighting (0.5), balancing pairwise genetic dissimilarity and allelic richness to obtain core collections with both structural and allelic diversity.
In total, 70 candidate core collections were generated (two grouping strategies × seven sampling methods × five sampling intensities). All subsets were constructed using the sampleCore() function with specified objective parameters and were subsequently evaluated for their genetic and phenotypic diversity.
Evaluation of the core collections
To evaluate the representativeness of the candidate core collections constructed under different strategies, a two-step evaluation framework was applied. In the first stage, genetic diversity (Ho, He, PIC, I) and genetic evenness (J′) of all 70 candidate cores was evaluated. Candidate core collections with relatively high diversity were retained. In the second stage, the selected core collections were further evaluated for phenotypic representativeness using 15 traits: diameter at breast height (DBH), basal diameter (BD), total culm height (TCH), total number of nodes (TNN), height to the first branch (HFB), number of nodes below the first branch (NNFB), internode length (IL), crown width (CW), average culm base wall thickness (ACWTB), average wall thickness at breast height (AWTH), culm cavity diameter (CCD), leaf thickness (LT), leaf area (LA), leaf length (LL), and leaf width (LW). These traits were chosen for their strong association with genetic variation, their relevance to growth, morphology, and functional characteristics of moso bamboo, and their importance to its economic value. Phenotypic evaluation involved four statistical indicators: Mean Difference Percentage (MD) = (St/n) × 100%, where St represents the number of traits with significant differences (α = 0.05) in t-tests between the core and original collections, and n denotes the total number of traits. Variance Difference Percentage (VD) = (SF/ n) × 100%, where SF is the number of traits with significant differences (α = 0.05) in F-tests between the core and original collections. Coincidence Rate of Range (CR) = \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{1}{n}\sum_{i=1}^{n}\frac{{R}_{c(i)}}{{R}_{I(i)}}$$\end{document} × 100%, where Rc(i) and RI(i) are the range of the i-th trait in the core and original collections, respectively. Coefficient of Variation Change (VR) = \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{1}{n}\sum_{i=1}^{n}\frac{{CV}_{c(i)}}{{CV}_{I(i)}}$$\end{document} × 100%, where CVc(i) and CVI(i) are the coefficient of variation for the i-th trait in the core and original collections, respectively [43]. An effective core collection should meet the criteria of MD < 20% and CR > 80%. In addition, a higher CR, VD, and VR, along with a lower MD, indicate a stronger representative quality of the core collection [43]. Based on this two-step evaluation framework integrating genetic and phenotypic data, the optimal core collections were identified for subsequent validation analysis.
Validation of core collections
To verify the quality of the optimal core collections, we performed t-tests to compare it with the original collection in terms of genetic diversity indices (Ho, He, PIC, I), genetic evenness (J′) and 15 phenotypic traits. For each phenotypic trait, we assessed differences in maximum value, minimum value, coefficient of variation (CV), and mean. Principal component analyses (PCA) on genotypic and phenotypic datasets were carried out to examine the structural composition of the core collection.
Supplementary Information
Supplementary Material 1. Supplementary Material 2. Supplementary Material 3. Supplementary Material 4. Supplementary Material 5. Supplementary Material 6. Supplementary Material 7.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Bonafide Research. (2024, February 29). China Bamboo Market Overview, 2029. Retrieved from https://www.bonafideresearch.com/product/6402154204/china-bamboo-market.
