A comprehensive analysis of two Chinese cucumber genomes and a mutant population as resources for precision breeding
Jiaxi Han, Jingwei Wei, Weiliang Kong, Weili Miao, Lidong Zhang, Yuhe Li, Jiawang Li, Xin Li, Tao Lin, Hongyu Huang

TL;DR
This study provides high-quality genomes for two Chinese cucumber types and a mutant library, aiding in understanding their evolution and improving breeding.
Contribution
The study presents high-quality genomes for two cucumber types and demonstrates effective mutagenesis for gene discovery.
Findings
Comparative analysis revealed structural variants between North and South China cucumber types.
A gene encoding chlorophyll oxidase was identified using EMS mutagenesis.
The study highlights the potential of forward genetics in cucumber breeding.
Abstract
Cucumis sativus L., commonly known as cucumber, is an important vegetable crop worldwide, with China as the largest producer, particularly of the North and South China types. While extensive genomic research has focused on the North China type, especially the Chinese Long 9930, studies on the South China type remain limited. In this study, we assembled high-quality genomes of two widely cultivated and representative parent varieties: S36 (North China type) and H19 (South China type), and conducted mutagenesis analyses. Comparative genome analysis revealed a large number of structural variants between two North China types and two South China types, with many of the affected genes showing strong homology to known functional loci, potentially contributing to phenotypic divergence. We also constructed an EMS mutant library through the mutagenesis of S36 and identified a gene that encodes…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6|
|
|
| |||
|---|---|---|---|---|---|
|
|
|
|
|
| |
| Size (Mb) | 262.9 | 224.8 | 240.1 | 255.8 | 247.1 |
| Year | 2024 | 2019 | 2020 | 2024 | 2020 |
| Number of chromosomes | 7 | 7 | 7 | 7 | 7 |
| Number of contigs | 2241 | 174 | 926 | 1946 | 851 |
| Contig N50 (Mb) | 24.1 | 8.9 | 2.1 | 22 | 5.3 |
| Number of genes | 27 852 | 24 714 | 25 167 | 27 536 | 25 382 |
| Number of gaps | 6 | 86 | 175 | 4 | 91 |
| Repeat content (%) | 47.30 | 32.50 | 37.20 | 46.25 | 37.70 |
| GC level (%) | 35.50 | 32.56 | 32.51 | 35.09 | 32.53 |
| BUSCOs (%) | 98.8 | 95.5 | 96.6 | 98.7 | 97.6 |
- —Construction of Beijing Science and Technology Innovation and Service Capacity in Top Subjects
- —111 Project10.13039/501100013314
- —Tianjin Major Project for Seed Industry
- —National Key Research and Development Program of China10.13039/501100012166
- —State Key Laboratory of Vegetable Biobreeding
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvances in Cucurbitaceae Research · Genetic diversity and population structure · Chromosomal and Genetic Variations
Introduction
Cucumber (Cucumis sativus L.) is a key vegetable crop and a valuable model for genetics and genomics research in Cucurbitaceae [1]. Advances in the high-throughput sequencing technologies have accelerated research on cucumbers, particularly on the genetic variants, such as single-nucleotide polymorphisms (SNPs) and small insertions/deletions (InDels), most of which have been identified mainly through the alignment of short sequencing reads to a reference genome [2–4]. However, a single reference genome cannot fully represent the sequence diversity within a species, a key limitation that has hindered the identification of larger structural variants (SVs), which play crucial roles in genome evolution and determine key agronomic traits in crops [5–8]. Therefore, a pan-genome study is essential to comprehensively examine structural variations in the cucumber genome. Insights from a pan-genome study can advance research on cucumber domestication and genes related to vital agronomic traits [5, 9–13].
An ethyl methanesulfonate (EMS) mutant population serves as a valuable tool for identifying new genes, understanding gene functions, and exploring the molecular mechanisms underlying key agronomic traits [14–16]. Genome assembly, a crucial tool in plant breeding and crop improvement, can be used in combination with an EMS mutant library to discover genes and develop novel breeding materials through forward genetic approaches [17]. For example, a study constructed the EMS mutant population of Chinese cabbage from a double haploid inbred line, A03, and successfully identified two chloroplast-associated genes through forward genetics with the sequenced and assembled A03 genome [15]. Similarly, a high-quality, gap-free telomere-to-telomere genome of the watermelon inbred line G42 was assembled. The EMS mutagenesis protocol was established, which generated 48 monogenic phenotypic mutations. Through these mutations, the study identified two genes responsible for elongated fruit shape and male sterility (ClMS1), both of which are caused by a single base change from G to A [16]. Many genes controlling the key agronomic traits in cucumber, such as fruit length, round leaves, and dwarf plants with short internodes, have been identified and functionally characterized from the EMS mutant populations [18–22]. However, gene functional studies in cucumber using whole-genome resequencing to create genetic variants are lacking.
Cucumis sativus has been categorized into four groups: Indian, Eurasian, East Asian, and Xishuangbanna, and it was introduced to China about 2000 years ago [23, 24]. Currently, the North China and South China types are the most widely cultivated cucumber varieties in the country [25]. The North China type is characterized by long fruit, dense warts, and deep and uniformly colored pericarps [26], whereas the South China type typically has short fruits with few warts. However, high-quality genome assemblies of South China type cucumber varieties remain unavailable, and EMS mutagenesis studies on North China type cucumbers have yet to be conducted. To address these gaps, we selected two cucumber varieties with strong commercial performance and comprehensive agronomic traits: S36 (North China type) and H19 (South China type). These varieties are widely used as parental lines in commercial F1 hybrids and have become the core germplasm in cucumber breeding. Considering the importance of an accurate and complete reference genome assembly for genetic and genome-wide studies within species [27], we performed the sequencing of S36 and H19 and assembled two high-quality cucumber genomes, which can serve as a reference for analyzing gene mutations and structural variations. In addition, we conducted EMS mutagenesis to enhance certain traits of S36. We investigated differences in their genomic structure and function by comparing their genomes. By precisely identifying the mutation sites in the EMS mutants and accurately mapping them to the genome, we discovered new genes that regulate cucumber agronomic traits and uncovered the underlying regulatory mechanisms. These findings align with current breeding needs and support the development of an intelligent breeding system based on genomic data.
Overall, this study constructed high-quality genome assemblies of S36 and H19, representing the North China and South China type cucumber varieties, respectively, and conducted a comparative genomic analysis. Type-specific SVs were identified, highlighting genomic differentiation between the two groups. Additionally, EMS mutagenesis was performed on S36 to generate mutants for the identification of genes associated with important agronomic traits, thereby providing genetic resources for the molecular improvement of cucumber.
Results
Genome assemblies of North China type cucumber S36 and South China type cucumber H19
S36, a well-known North China type cucumber, is characterized by its long fruits, dense warts, and lustrous green pericarp. In contrast, H19, a representative South China type, produces short fruits with sparse large warts and uneven pericarp coloration, reflecting pronounced phenotypic divergence in fruit and pistil traits (Fig. 1A and B). De novo genome assemblies were generated using 51.54 Gb PacBio HiFi data for each variety, achieving approximately 196× genome coverage (Tables S1 and S2). The initial assembly of S36 produced 2241 contigs with a contig N50 of 24.1 Mb, while the H19 assembly yielded 1946 contigs with a contig N50 of 22.0 Mb. Subsequent scaffolding was performed using 50.51 Gb of chromosome conformation capture sequencing (Hi-C) data (nearly 192× genome coverage), and chromatin interaction maps generated via Juicebox confirmed complete pseudomolecule organization. This process resulted in 1634 scaffolds for S36 (scaffold N50: 37.24 Mb) and 1433 scaffolds for H19 (scaffold N50: 34.60 Mb). Finally, 262.93 and 255.79 Mb of the assembled sequences were successfully anchored to the seven chromosomes of the S36 and H19 genomes, respectively (Fig. S1). Genome annotation identified 124.36 Mb (47.3%) of repetitive sequences in the S36 genome and 118.31 Mb (46.25%) in the H19 genome (Tables S3 and S4). Additionally, the genomes exhibit 35.50% and 35.09% GC content, respectively. Both repetitive sequence proportion and GC content are higher than those in other reported cucumber genomes, such as 9930 (v3), XTMC, and Cu2 genomes (Table S5) [8, 27]. Following repeat masking, 27 852 and 27 536 protein-coding genes were predicted in the S36 and H19 genomes, respectively, with average gene lengths of 3270 and 3311 bp, and 5.2 exons per gene on average (Table S6).
Phenotypic and genomic divergence between S36 and H19 genomes. (A) Fruit comparison between S36 (left) and H19 (right). Scale bar is 5 cm. (B) Pistil comparison between S36 (left) and H19 (right) on flowering day. Scale bar is 1 cm. (C) Syntenic collinearity analysis between S36 and H19 genomes. (D) Whole-genome architecture comparison between S36 and H19 genomes.
The quality and completeness of the S36 and H19 genome assemblies were assessed using multiple methods. First, 98.82% (1595 of 1614) and 98.76% (1594 of 1614) of conserved embryophyta BUSCO genes were identified in S36 and H19 genomes, respectively (Tables S7 and S8). Second, only six sequencing gaps were detected in S36 and four in H19, representing sequencing gaps between contigs (Tables S9 and S10)—fewer than those reported in previously published cucumber genomes (Table 1) [8, 27]. Third, 85.51% of genes in S36 and 86.10% in H19 were functionally annotated using integrated data from multiple databases, including NCBI nonredundant (NR), Gene Ontology (GO), SwissProt, InterProScan, The Arabidopsis Information Resource, and Kyoto Encyclopedia of Genes and Genomes (KEGG) (Tables S11 and S12). These results collectively confirm the high completeness and annotation quality of both genome assemblies.
To elucidate structural divergence, a comparative synteny analysis identified 213 conserved homologous blocks between the S36 and H19 genomes (Fig. 1C). Notably, 221 unaligned regions spanning 15.25 Mb (3.18%) in S36 and 193 unaligned regions covering 8.12 Mb (5.90%) in H19 were detected (Table S13). Further investigation characterized gene distribution and transposable element (TE) dynamics across all seven chromosomes using 1-Mb windows (Fig. 1D).
Structural variations between the North China type and South China type cucumber genomes
Structural variations play crucial roles during plant domestication, contributing to differences in the key characteristics [8]. To explore the genetic basis of these variations, the genomes of two North China type cucumber varieties (9930v3 and XTMC) and two South China type cucumber varieties (H19 and Cu2) were aligned to the S36 genome to identify genetic variants [8, 27]. After filtering out SVs smaller than 50 bp and those located in assembly gaps, 2391 SVs (approximately 22.45 Mb) were identified in the North China type genomes and 3765 SVs (approximately 20.28 Mb) in the South China type genomes. Additionally, 1700 SVs (~73.48 Mb) were specific to the North China type and 3074 (~61.62 Mb) were specific to the South China type (Fig. 2A). Most SVs were less than 5 kb in length (Fig. 2B), and the majority were located in non-TE regions (Fig. 2C). Moreover, these SVs significantly overlapped with 5219 genes in the S36 genome. Among these, 255 and 353 genes were specifically affected by breakpoints of North China type and South China type SVs, respectively. Notably, six of these genes showed high homology to the previously reported functional genes (Fig. 2D) [28–33]. In addition, polymerase chain reaction (PCR) validation of common type-specific SVs confirmed 91.9% of cases, further supporting the accuracy of SV detection (Fig. S5 and Table S14).
Structural variations between North China type and South China type genomes. ‘north’ represents North China type genomes. ‘south’ represents South China type genomes. (A) The number of structural variations between North China type and South China type genomes. (B) The length distribution of SVs among four cucumber genomes. (C) The number of SVs located in TE regions. (D) The distribution of SVs in the seven chromosomes and genes affected by North China type and South China type SVs, respectively. (E) The sequence composition of North China type and South China type SVs.
RNA-Seq analysis was performed using the RNA-seq data from the genomes of 9930 and H19, which revealed 7235 differentially expressed genes (DEGs). Of these, 64 and 101 DEGs overlapped with those affected by SV breakpoints in the North China type and South China type genomes, respectively (Fig. S2). Functional enrichment analysis indicated that these genes primarily function in cellular processes, playing mainly intracellular and cytoplasmic roles in the North China type genomes, and a catalytic role in the South China type genomes (Figs S3 and S4). The sequence composition of SVs also differed markedly between the North China type and South China type genomes, particularly in terms of small RNA and DNA families (Fig. 2E).
Disease resistance in plants is often associated with SVs in the form of tandem arrays of resistance (R) genes [34, 35]. To identify R gene analogs between the previously reported North China type and South China type genomes, a homology analysis of these genes was conducted. A total of 881 R genes were identified in the S36 genome (Fig. S6), of which 42 (4.77%) overlapped significantly with those in the SVs. Among these genes, 14 and 22 were found to be affected by SVs specific to the North China type and South China type genomes, respectively. Additionally, the distribution of R genes was uneven across each chromosome (Fig. 2B). Subsequently, differential SNPs and variant genes were analyzed between the genomes. The South China type genomes were found to have more SNPs than the North China type genomes (928 493 versus 518 925), and the majority of these SNPs were classified as modified-effect ones (Tables S15–S17).
Comparison between the genomes of S36 and two North China type cucumber varieties
The S36 genome was compared with the 9930 (v3) and XTMC genomes to identify SVs [8, 27], revealing overall collinearity among the three genomes. Comparative genomic analysis identified 148 syntenic blocks between S36 and 9930, and 109 blocks between S36 and XTMC (Fig. 3A). Relative to the 9930(v3) (210.94 Mb) and XTMC (204.50 Mb) assemblies, S36 included an additional 54.35 and 56.16 Mb of anchored sequences, respectively, predominantly localized to pericentromeric regions (Fig. S7). As a result, S36 chromosomes ranged from 26.45 to 46.14 Mb in length, exceeding those of 9930, which ranged from 22.47 to 40.88 Mb [27]. Next, the telomere-specific and centromere-specific repeats were screened in the genomes of S36 and the two North China type cucumber genomes. The results indicated the S36 genome possesses seven telomeres, more than the number of telomeres in any other genome (Table S18). The centromeres in all chromosomes of the S36 genome were significantly larger (Fig. 3B), indicating that the centromere assembly in the S36 genome is probably of higher quality. Long terminal repeats (LTRs), the most abundant subgroup of TEs in cucumber, were also compared across the three genomes. The results showed that the number of newly inserted LTRs in the S36 genome is more than twice that in the other genomes (Fig. 3C). A further comparison of the genome components revealed that the proportion of TEs and total TE sequence length are significantly higher in the S36 genome (Fig. 3D). These data suggest that the S36 genome has undergone a stronger LTR expansion compared with the other cucumber genomes, possibly due to tissue culture [36, 37].
Comparison of S36 with the North China type cucumber genomes. (A) The collinearity map of the S36 and two North China type (9930 and XTMC) cucumber genomes. (B) Centromere identification results for three North China type (9930 and XTMC) cucumber genomes. Dashed lines denote centromeric regions. (C) Distributions of insertion times dated using intact LTRs in three North China type (S36, 9930, and XTMC) cucumber genomes. (D) Percentage and size of genomic components in three North China type (S36, 9930, and XTMC) cucumber genomes.
Whole-genome resequencing and screening of EMS populations
A mutant library of S36 was constructed through EMS mutagenesis. EMS treatment of cucumber seeds yielded 1785 M_2_ generation lines, of which 516 were cultivated, leading to the identification of 45 stably inherited mutants. These mutants exhibited notable differences from the wild-type in terms of various phenotypes, such as plant height, leaf shape and color, fruit length, internode length, and the density and size of warts (Fig. 4A–F). To further investigate the genetic basis of these phenotypic variations, whole-genome resequencing analysis was performed on nine representative mutants selected from the 45 identified mutants. The nine mutants yielded a total of 6.77 Gb of data, with the amount of data generated by an individual mutant ranging from 583.47 Mb (ES23) to 1.18 Gb (ES226), averaging 752.60 Mb, which corresponded to a sequencing depth of 29.24× (Table S19). After screening and filtering the SNPs using a conventional approach, we excluded those in the TE regions from further analysis, retaining 80 536 SNPs distributed almost uniformly on each chromosome (Fig. 5E). Each mutant had an average of 57 320 SNPs (Fig. 5A). On average, 5410 genes per mutant were affected by these SNPs (Fig. 5B). The number of SNPs per mutated gene varied greatly (Fig. 5C). In addition, the reference genome annotation revealed that 57.11% of 141 064 SNPs were located in the gene space, covering 17 318 genes (62.18% of all genes). These mutations were concentrated more in upstream regions (37.99%) than in downstream regions (19.64%) (Fig. 5D). Analyzing the SNP features of each mutant revealed the predominance of C/G to T/A and T/A to C/G mutations (30.04% and 29.88% on average, respectively) (Fig. 5F). The SnpEff program was used to predict the function of each mutated gene [38]. The majority of mutations (79.69% on average) were classified as the modified-effect ones; however, only a few of these (approximately 1.03%) were expected to have a strong effect (Table S20) and were located mainly in the upstream and downstream regions (Table S21).
Phenotypic investigation of EMS mutant library. (A) Comparison of plant height between mutants and S36. Scale bar is 10 cm. (B) Leaf comparison between mutants and S36. Scale bar is 10 cm. (C) Comparison of flower colors. Scale bar is 2 cm. (D) Fruit comparison between mutants and S36. Scale bar is 5 cm. (E) Internode comparison between mutants and S36. Scale bar is 10 cm. (F) Magnified view of warts of mutants and S36. Scale bar is 1 cm. (G) Comparison of petiole-stem angle between mutant and S36. Scale bar is 10 cm. All cucumber materials were observed after three months of growth.
Characterization of the EMS-induced mutations in S36. (A) Distribution of SNPs in each mutant. (B) Distribution of the number of genes with mutations in each mutant. (C) Distribution of SNPs in each gene. (D) Distribution of SNPs with predicted effects on gene functions. (E) Mutation distribution and density for the mutations identified in mutants on seven chromosomes. (F) Ratios of different mutations identified in mutants.
Fine mapping and functional verification of CsCAO
Leaf color mutants serve as a valuable reference for studying the genetic mechanisms underlying plant photosynthesis, chlorophyll biosynthesis, development, degradation, and tetrapyrrole synthesis, among others [39]. After 3 months of growth, phenotypic observations revealed that the mutant (ES299) exhibited yellow-green colored stems, petioles, ovaries, fruits, and leaves, which differed from the wild-type S36 (Fig. 6B). The total chlorophyll and carotenoid contents of leaves, petioles, stems, ovaries, and exocarps in ES299 were significantly lower than those in S36; however, the ratio of chlorophyll a to chlorophyll b was relatively high in the mutant (Fig. 6C–E). Additionally, ES299 showed weaker growth potential than S36 during the same growth period, as evidenced by a noticeable decrease in its plant height, stem diameter, internode length, and leaf number (Fig. S8). Using the green leaf inbred line G35 and the yellow-green leaf ES299 mutant as parents in the hybridization experiment, the proportions of progenies exhibiting green and yellow-green leaves aligned with the expected segregation ratios of 3:1 and 1:1 in the F_2_ and BC_2_ populations, respectively, suggesting that the mutant trait was governed by a recessive gene (Table S22).
*Gene mapping of ES299 mutant and functional verification of CsCAO. (A1) ∆(SNP index) of all cucumber chromosomes. SNP index peak is found between 9.5 and 18.1 Mb on chromosome 6. (A2) The genotyping of recombinant plants from F2 population using the 16 markers allowed mapping the mutant gene in a 541-kb region of chromosome 6. (A3) The location and structure of Csa6G385090 in the region of 541 kb. The white boxes indicate the 5′UTR and 3′UTR positions of Csa6G385090, and the black boxes and broken lines represent the positions where exons and introns are located, respectively. (A4) Nucleotide and protein sequence alignment of Csa6G385090 in G35, S36, and ES299. (B) Phenotypic observation of S36 and ES299. From left to right: whole plant (scale bar, 20 cm), stem, petiole, ovary, fruit, and leaf (scale bar, 3 cm). (C–E) The total chlorophyll, carotenoid, and the ratio of chlorophyll a to chlorophyll b in different tissues of S36 and ES299 are determined. Values are means ± SE (n = 4). (F) Identification of TRSV::CsCAO silencing lines by qRT-PCR. Values are means ± SE (n = 3). (G) Phenotypic observation of TRSV::00 and TRSV::CsCAO. From left to right: whole plant, leaf, stem, and petiole (scale bar, 2 cm). (H–J) The total chlorophyll, carotenoid, and the ratio of chlorophyll a to chlorophyll b in different tissues of TRSV::00 and TRSV::CsCAO are determined. Values are means ± SE (n = 3). Student’s t test is used to test the significance of the data. *P < 0.05; *P < 0.01.
To further identify the candidate gene, four DNA pools (parental and progeny pools of G35 and ES299) were subjected to whole-genome resequencing. The high-quality reads were mapped to the S36 genome and combined with the parental resequencing data. Preliminary mapping results for the yellow-green leaf trait were obtained through bulked segregant analysis (BSA) analysis using the Euclidean distance and SNP index association algorithm (Fig. 6A1). The confidence interval associated with leaf color traits was found on chromosome 6, which is located within the 9.5- to 18.1-Mb region. In this interval, 96 F_2_ recessive plants were genotyped using 16 KASP markers (Fig. 6A2). Finally, the candidate region of the mutated trait was narrowed down to a 541-kb region flanked by the markers M13 and M16 (Fig. 6A3). This interval comprised five SNPs, four of which are located in the intergenic region, and one nonsynonymous SNP is located in the coding region of Csa6G385090; these SNPs resulted in an amino acid change from leucine to phenylalanine at position 361 (Fig. 6A4 and Table S23). In the ES299 mutant, an L-to-F amino acid mutation occurs within the PobA domain of CsCAO, which is responsible for the catalytic activity of chlorophyllide a oxygenase. Three-dimensional structural analysis indicated that this mutation induced subtle changes in the local side-chain conformation (Fig. S9), potentially affecting substrate access to the catalytic pocket.
Csa6G385090 encodes a chlorophyll a oxygenase (CsCAO) that catalyzes the conversion of chlorophyll a into chlorophyll b. Three CsCAO-silenced lines (TRSV::CsCAO) with 70–80% reduction in CsCAO mRNA levels were obtained from 50 plants infected with TRSV-CsCAO by using the TRSV-VIGS system (Fig. 6F). The leaves, petioles, and stems of the TRSV::CsCAO lines exhibited a lighter coloration than those of the control lines (TRSV::00) (Fig. 6G). Furthermore, the TRSV::CsCAO lines exhibited significantly lower contents of chlorophyll in the leaves, petioles, and stems but a higher ratio of chlorophyll a to chlorophyll b (Fig. 6H–J). In summary, the phenotypic traits of the CsCAO-silenced plants were similar to those of the ES299 mutant, highlighting that CsCAO silencing inhibited chlorophyll synthesis and resulted in lighter color of the organs in cucumber. These results reinforce the utility of the EMS mutant library construction for identifying candidate genes associated with visible phenotypic traits using forward genetic approaches [15]. Furthermore, these results validate the reliability of genome assemblies combined with mutant libraries for a rapid functional gene identification, providing a reliable pathway for the functional characterization of genes.
Discussion
Historically, cucumber is believed to have originated in the Himalayan Mountains, with its domestication dating back to nearly 3500 years ago [23]. According to morphological characteristics and geographical distribution, it is classified into four types as follows: Indian, Eurasian, East Asian, and Xishuangbanna [23, 24]. Similarly, Chinese cucumber varieties have been classified into four subgroups—South China type, North China type, Xishuangbanna type, and Europe type; of these, the first two types have originated from the previously reported East Asian group [24, 40]. Although molecular markers have been commonly used to investigate the genetic diversity of cucumber germplasm resources in China, only a few markers or varieties have been tested to date, and a comprehensive study of cultivated Chinese varieties is lacking [24, 40, 41]. Although North China and South China type cucumber varieties exhibit significant differences in fruit length, wart size and density, and other phenotypic traits, a comparative genomic analysis between North and South types based on genomic sequences has not been reported yet.
The South China and North China types are unique cucumber varieties in China, and the market demand and consumption of both varieties remain high [25]. S36 and H19 cucumber varieties have become core germplasm in commercial cucumber breeding. Therefore, the high-quality genome assembly of S36 and H19 can serve as a valuable resource for the identification of candidate genes related to vital agronomic traits in cucumber. In this study, we assembled S36 and H19 genomes, both of which have higher quality than those of previously published cucumber genomes [8, 27]. However, additional research is needed to construct a gap-free assembly of the cucumber genome, which could not be achieved in this study. In this study, the representative genomes of the two cucumber ecotypes were selected to identify the SVs specific to each type and the genes they affect, providing insights into the classification and evolutionary pattern of cucumber varieties. Among these, Csachr3g0053080 (AtRWA2) and Csachr3g0053080 (AtSTY17), associated with resistance to Botrytis cinerea, and Csachr5g0045730 (CsMLO1), associated with resistance to powdery mildew disease, were found to be affected by north- and south-specific SVs. These findings suggest that the three genes may be related to the disease susceptibility of South China type cucumber varieties [28, 32, 33]. Similarly, Csachr1g0015260 and Csachr6g0025220, which are highly homologous to AtNTL8 and AtUPL6, respectively, were found to regulate trichome formation in Arabidopsis and may be related to the differences in terms of fruit wart phenotype between the South and North varieties [29, 30]. Previous studies have demonstrated that waxy fruits exhibit lower surface gloss [31]. Csachr3g0010420 (CsCER1), which is involved in wax metabolism, was found to influence fruit skin glossiness and may contribute to the difference in glossiness between the South and North types [31, 42]. The discovery of these genes lays a foundation for the classification of cucumber varieties and further research on the genetic factors affecting the phenotypic differences among them. Moreover, differences in the TE composition of SVs between the genomes of North and South types suggested that TEs might have been subjected to differential selection due to differences in the environmental condition between the north and south regions. Furthermore, significant differences were observed in the chromosomal distribution and number of R genes affected by SVs, suggesting that these genes might have undergone differential selection due to differences in individuals’ dietary habits between the northern and southern regions [24]. While the identified SVs serve as a valuable resource for cucumber breeding, further functional validation of these SVs is necessary to understand their precise role in phenotypic trait regulation. The expression of these SVs might have also been influenced by the genetic background of different cucumber varieties, warranting further investigation in this direction. Moreover, in addition to the North and South China type cucumber varieties, other cucumber varieties exist, which contain favorable genes. Introgression of these genes in crop breeding can enrich the genetic diversity of the South China and North China type cucumber [40].
This study considered S36 as the reference genome and performed whole-genome resequencing on nine EMS mutants. The analysis yielded systematic phenotypic data of the mutants, along with the associated genomic information, which can advance functional genomics research in cucumber. Additionally, the mutants generated in this study can be used directly in cucumber breeding programs to introduce favorable traits, such as high flowering rate and pathogen resistance. This study also identified and annotated various TEs, including newly inserted, intact LTRs, in the S36 genome. Moreover, the nine mutant lines were selected for sequencing analysis, which enabled the accurate and efficient detection of mutations. Through whole-genome sequencing of these nine mutants, we identified 80 536 SNPs, located in both coding and noncoding regions, which can serve as a valuable resource for future functional studies [43].
A high-quality genome, combined with a large corresponding mutant library, facilitates the identification and cloning of candidate genes [44, 45]. This strategy has been used successfully in watermelon [16] and Chinese cabbage [15] to clone many genes from EMS libraries. In plants, chlorophyll is an essential pigment that plays a crucial role in energy transfer and transformation in photoreactions by absorbing solar energy and binding to various Chl-binding proteins [46]. In this study, we utilized the yellow-green leaf mutant ES299 to investigate the genetic basis of chlorophyll biosynthesis. Leveraging the S36 genome and through whole-genome resequencing of 9 mutant plants, we identified CsCAO as a candidate gene for chlorophyll synthesis from the EMS mutant library. CsCAO mutants demonstrated disrupted chlorophyll biosynthesis and thus a lighter coloration of their organs. In summary, the new reference genome, together with the EMS mutant library, serves as a powerful tool for the genetic analysis and enhancement of cucumber traits, as well as a reliable pathway for determining gene functions.
Through the genome assembly and mutation analysis of South China and North China type cucumber varieties, the present study sheds light on the classification and evolution of Chinese cucumber varieties. The high-quality assembly and annotation of the S36 genome enhanced the accuracy and efficiency of mutation detection, underscoring the usefulness of the forward genetic approach as a valuable tool for functional genomics research in cucumber. Using this approach to identify genes associated with desirable agronomic traits can expedite advancements in cucumber breeding. Finally, the variants identified in this study can facilitate improvements in key agronomic traits of not only cucumber but also other closely related Cucurbitaceae crops.
Materials and methods
Plant materials and sequencing
The S36 cucumber variety was derived from the commercial cultivar ‘Ke Run 99’ through 10 generations of continuous self-pollination. Two commercial varieties, ‘Wei Lai 103’ and ‘Jin Mei Han Yu’, were crossed, and the hybrid offspring were self-pollinated for eight generations to obtain the stable material H19. Both were planted in the Tianjin Academy of Agricultural Sciences (TAAS). The leaves of S36 and H19 were used to construct the PacBio HiFi library for genome sequencing [47]. The Hi-C libraries were built according to the Proximo Hi-C plant protocol with the restriction enzyme MboI [48]. The sequencing libraries were sequenced using the PacBio Sequel II/IIe sequencing platform or the Revio platform at Berrygenomics Company. Optical equipment was used to convert the raw data into the initial output file, Polymerase reads. These Polymerase reads were then subjected to basic filtering using the instrument’s built-in software, SMRT Link, and were subsequently converted into Subreads.bam (Sequel II/IIe CLR/CCS mode), Reads.bam (Sequel IIe CCS mode), or hifi_read.bam (Revio sequencing).
De novo assembly of S36 and H19 genome
Hifiasm, an efficient open-source de novo assembler specifically designed for HiFi reads, was used to extract overlaps and build the assembly graph. This approach enabled the separation of distinct alleles or different copies of segmental duplications containing a single segregating site [49]. A preliminary contig reference genome of approximately 387 Mb for S36 and 372 Mb for H19 was obtained, and the genome continuity was evaluated based on the contig N50 length. Hi-C reads were aligned to the enhanced contigs using Juicer (V1.5) for feature analysis and data extraction [50]. The output was processed with 3D-DNA to correct misjoins, as well as to order, orient, and scaffold the sequences, resulting in an improved assembly [51]. Finally, Juicebox was used to visualize and interactively assemble the genome by manually adjusting chromosome boundaries and fixing some minor errors [52]. The completeness of the assembled genomes was evaluated using BUSCO (V5.2.1) with the embryophyta_odb10 database [53, 54]. Synteny analysis was conducted using the MUMmer package (V3.23) to compare the assembled genomes with the 9930 (v3) genome [55]. First, NUCmer was used to perform comparisons between genomes with the parameters -mum -mincluster 100. Subsequently, Delta-filter was used to filter the alignment file generated by NUCmer with the parameters -l 1000 -1. Finally, a dotplot was generated with mummerplot [56], and the chromosomes of S36 and H19 were renamed according to the sequence of 9930 chromosomes. Furthermore, SYRI (V1.5.4) [57] was used to perform a genome-wide comparison of SVs between the assembled genomes and the 9930 genome.
Structural annotation and functional annotation of genes
The genome repeats were identified and annotated using RepeatModeler (V2.0.1) [58] and RepeatMasker (V4.1.0) [59] based on a custom repeat sequence library. Gene prediction was performed on the masked genome sequences using three complementary approaches: homologous prediction, RNA-seq-based prediction, and de novo prediction [15]. For RNA-seq-based prediction, transcriptomic data were generated from mixed samples of S36 and H19 (including leaves, female flowers, male flowers, fruit, and tendrils), 9 near-isogenic line materials (pericarp), and 10 different stages and sites of 9930 (including leaves, tendrils, roots, ovaries, female flowers, male flowers, and stems) [60]. These data were analyzed using Trinity and PASA to predict genes [61, 62]. For homologous prediction, protein sequences from 10 cucurbitaceae plants were retrieved from the Cucurbit Genomics Database (http://cucurbitgenomics.org), and ProtHint was used to align and splice predicted genes to a reference protein database [63]. For de novo prediction, the processed transcriptomic and protein data were mapped to the S36 genome, and protein-coding genes were predicted using AUGUSTUS [64] and MAKER [65]. The predicted annotations were validated using in-house Python scripts to ensure the correct placement of start and stop codons, and genes containing internal stop codons were removed. Finally, genes with coding sequences (CDSs) shorter than 300 bases were filtered out to obtain the final gene structural annotation. The functions of the S36 and H19 protein-coding genes were predicted using NCBI (NR), SwissProt, and the Arabidopsis database via Diamond (V0.9.24.125) [66]. Additionally, protein domain and gene ontology term annotations were performed using InterProScan (V5.59) [67], along with the Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) (http://eggnog-mapper.embl.de/) Automatic Annotation Server [68]. DRAGO2, a tool of PRGdb (V4.0), was used to predict R genes in the S36 genome [69].
Genome-wide comparison of cucumber genomes
The genotypic data of our 210 materials were filtered for 4D SNPs [70] using a custom script. These 4-fold degenerate sites were then used for phylogenetic analysis with PHYLIP [71], and the phylogenetic tree was visualized using iTOL (https://itol.embl.de). Genome data for two North China type cucumbers (9930, XTMC) and one South China type cucumber (Cu2) were retrieved from Cucumber-DB (http://www.cucumberdb.com) and CuGenDB (http://cucurbitgenomics.org). Genome-wide comparisons between the S36 cucumber genome and other cucumber genomes were conducted using minimap2 [72] and the MUMmer package (V3.23) [55]. SVs between seven cucumber genomes were identified using SYRI(V1.5.4) [57]. SnpEff was used to annotate SVs larger than 50 bp [38]. A custom script was employed to calculate the sequence types and repeat sequence content associated with the SVs. Genes within the SV regions were considered potentially impacted, and the distribution of SVs, along with the functional gene and R genes affected by these SVs, was visualized using the RIdeogram package [73]. To construct LTR libraries, LTR_Finder (V1.07) [74] was used with default settings, and LTR_harvest (V1.6.1) [75] was employed with parameters ‘-minlenltr 100 -maxlenltr 7000 -motif TGCA -similar 90 -seed 20’. The results from LTR_Finder and LTR_harvest were merged using LTR_retriever (V2.9.0) [76] to generate the final LTR library. Subsequently, the LTR library was combined with the TE library, and LTR insertion times were calculated using a custom script. RepeatModeler (V2.0.1) [58] and RepeatMasker [59] were used to annotate and classify based on the constructed library with default parameters. The centromeres of the genome were determined using TandemRepeatFinder (TRF) [77], telomeres were identified using quarTeT [78], and the density of genes and centromeres was calculated using a custom script. These results were then visualized using RIdeogram [73].
EMS treatment and phenotypic investigation
Over 10 000 seeds with a 98% germination rate from full and plump grains were selected and soaked in double-distilled water for 4 h. The seeds were then treated with a 0.4% (W/V) EMS solution and shaken on a shaker for 12 h (EMS purchased from Sigma). Subsequently, the seeds were immersed in a 5% NaS_2_O_3_ (sodium thiosulfate) solution for 2 h (detoxification), washed with tap water for 2 h, and placed in a temperature-controlled box for germination. In early spring, the seedlings were grown in a greenhouse with regular management in TAAS. Then, they were transplanted into the greenhouse, and phenotypic data were recorded regularly. During flowering, female flowers with more than 15 nodes were selected for strict self-pollination. After fruit maturation (approximately 40 days postpollination), seeds were collected from individual plants, labeled, and used for summer sowing. Many mutants appeared in this generation, and a phenotypic survey was conducted, followed by strict self-pollination. Seeds were then collected from individual plants, and mutant seeds from individual plants were planted to observe and measure phenotypes.
Mutation detection in the mutants
The clean reads of mutants were aligned to the S36 genome using the BWA-MEM (V0.7.71) [79] with default parameters, and mapping results were obtained in SAM format. These results were subsequently processed using SAMtools (V1.9) [80] to convert the SAM format to BAM format, sort the BAM file, and obtain a consensus genotype for each locus. BCFtools (V1.18) [81] was then employed to convert the BAM format to VCF format. High-quality SNPs and INDELs (QUAL > = 30, DP > = 2, MQ > = 30) were used for subsequent mutation analysis. Additionally, SVs in the mutants were predicted using DELLY (V0.8.7) [82].
Mutant materials and growth conditions
A yellow-green leaf mutant, ES299, was obtained by EMS mutagenesis of the green leaf inbred line S36. Genetic analysis and gene mapping were carried out using distant green leaf advanced inbred line G35 and mutant ES299 as parents. All of the above cucumber plants were grown in a plastic greenhouse at 28–32°C under the natural light conditions provided by the Tianjin Kerun Cucumber Research Institute. The VIGS experiment was carried out using North China type cucumber XTMC as the material. The infected seedlings were cultivated in an artificial climate chamber with a 16-h light (22°C) and 8-h dark (18°C) cycle.
Determination of growth indexes
The vertical distance from the cotyledon node to the apical bud (plant height) and the distance between two adjacent nodes (node length) of wild-type S36 and mutant ES299 grown for 3 months were measured with a ruler. The stem diameter of the cotyledon node position was measured with a Vernier caliper. The total number of leaves of plants growing for 3 months was counted. Each index was measured for at least six biological replicates.
Determination of pigment contents
The pigment contents were determined according to the standard method of Lichtenthaler [83]. Cucumber tissues (0.2 g) were chopped and extracted with 95% ethanol for 24 to 48 h until the samples no longer faded. Subsequently, the absorbance of the extracts was measured at 665, 649, and 470 nm by a microplate fluorometer, with each measurement repeated three times [84]. According to the formula, the contents of chlorophyll a, chlorophyll b, and carotenoids were calculated [85].
Genetic mapping of candidate genes
G35 was crossed with ES299 to produce F_1_ generation, and F_1_ plants were selfed and backcrossed to obtain F_2_ and BC_2_ populations, respectively. Subsequently, the number of green leaf plants and yellow-green leaf plants in the progeny segregating population was counted, and the Chi-square (χ^2^) test was used to analyze the trait separation rate. The equal amounts of DNA from green leaf and yellow-green leaf extreme phenotypes in F_2_ population were selected to construct two extreme sequencing mixed pools, and parental DNA was used to construct the parental pools for sequencing analysis. The sequencing reads were aligned to the cucumber reference genome (9930v3). Subsequently, SAMtools [80] and GATK [86] software were used to process the data to obtain high-quality SNPs. All SNPs were annotated and mapped to seven chromosomes of cucumber. The SNP index and ∆SNP index were calculated to determine the chromosomal regions linked to the mutant phenotype and possible mutation sites [87]. Based on the analysis results of MutMap, KASP genotyping was performed on the candidate SNPs to further determine the candidate genes. A total of 96 F_2_ plants were used for KASP genotyping. The KASP thermal cycle conditions are programmed according to the description of Xi et al. (2018) [88].
Tobacco ringspot virus-base-VIGS system in cucumber
The specific CDS fragment of CsCAO was inserted into pTRSV2 vector and then transformed into Agrobacterium tumefaciens GV3101. The Agrobacterium solutions containing the pTRSV1 and pTRSV2 vectors were mixed in equal volumes and incubated for 3 h at 28°C. The mixed bacterial solution was infected with XTMC cucumber seed buds for 20 min under a vacuum condition of −900 kPa. Subsequently, the seeds were placed at 25°C in the dark for 3 days. Finally, the cocultured seeds were planted in a light incubator (22°C/16-h light, 18°C/8-h dark) for a 1-month culture. The expression level of CsCAO was detected by quantitative real-time polymerase chain reaction (qRT-PCR), and plants with a 60% to 80% decrease in CsCAO expression were selected for subsequent experiments [89, 90].
Conserved domain prediction and three-dimensional structure analysis
The conserved domains of the CsCAO protein were predicted using the NCBI Conserved Domain Database (CDD) search tool (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi). The three-dimensional structure of CsCAO was modeled and downloaded in PDB format. Structural visualization and comparative analysis of the wild-type and mutant proteins were performed using PyMOL, with a focus on alterations in secondary structure, spatial conformation, and potential active sites.
Supplementary Material
Web_Material_uhaf284
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Guan J, Miao H, Zhang Z. et al. A near-complete cucumber reference genome assembly and Cucumber-DB, a multi-omics database. Mol Plant. 2024;17:1178–8238907525 10.1016/j.molp.2024.06.012 · doi ↗ · pubmed ↗
- 2Lin T, Zhu G, Zhang J. et al. Genomic analyses provide insights into the history of tomato breeding. Nat Genet. 2014;46:1220–625305757 10.1038/ng.3117 · doi ↗ · pubmed ↗
- 3Zhao G, Lian Q, Zhang Z. et al. A comprehensive genome variation map of melon identifies multiple domestication events and loci influencing agronomic traits. Nat Genet. 2019;51:1607–1531676864 10.1038/s 41588-019-0522-8 · doi ↗ · pubmed ↗
- 4Xie D, Xu Y, Wang J. et al. The wax gourd genomes offer insights into the genetic diversity and ancestral cucurbit karyotype. Nat Commun. 2019;10:515831727887 10.1038/s 41467-019-13185-3PMC 6856369 · doi ↗ · pubmed ↗
- 5Lyu X, Xia Y, Wang C. et al. Pan-genome analysis sheds light on structural variation-based dissection of agronomic traits in melon crops. Plant Physiol. 2023;193:1330–4837477947 10.1093/plphys/kiad 405 · doi ↗ · pubmed ↗
- 6Liu Y, du H, Li P. et al. Pan-genome of wild and cultivated soybeans. Cell. 2020;182:162–176.e 1332553274 10.1016/j.cell.2020.05.023 · doi ↗ · pubmed ↗
- 7Cai X, Chang L, Zhang T. et al. Impacts of allopolyploidization and structural variation on intraspecific diversification in Brassica rapa. Genome Biol. 2021;22:16634059118 10.1186/s 13059-021-02383-2PMC 8166115 · doi ↗ · pubmed ↗
- 8Li H, Wang S, Chai S. et al. Graph-based pan-genome reveals structural and sequence variations related to agronomic traits and domestication in cucumber. Nat Commun. 2022;13:68235115520 10.1038/s 41467-022-28362-0PMC 8813957 · doi ↗ · pubmed ↗
