Chromosome-scale genome assembly of the fire blight resistant Malus fusca accession MAL0045, donor of FB_Mfu10
Ofere Francis Emeriewen, Thomas Wolfgang Wöhner, Henryk Flachowsky, Andreas Peil

TL;DR
Scientists assembled the genome of a wild apple variety resistant to fire blight, helping to study its disease resistance.
Contribution
A chromosome-scale genome assembly of a fire blight resistant wild apple variety is presented.
Findings
A genome assembly of MAL0045 was created with 99.75% of sequences anchored onto 17 chromosomes.
The assembly has a high BUSCO score of 97.46% and predicted 47,388 genes.
The assembly will aid in functional and comparative genomics studies of fire blight resistance.
Abstract
The wild apple, Malus fusca accession MAL0045, is highly resistant to fire blight disease, caused by the bacterial pathogen, Erwinia amylovora. A major resistance locus, FB_Mfu10 was identified on chromosome 10 of MAL0045 including other contributory loci on chromosomes 16, 4, and 15. Here, we report a chromosome-scale genome assembly of MAL0045 to facilitate the studies of its fire blight resistance. PacBio sequencing and Illumina sequencing for Hi-C contig anchorage were employed to obtain the genome. A total of 669.46 Mb sequences were anchored onto 17 chromosomes, taking up 99.75% of total contig length. Contigs anchored onto chromosomes were further ordered and orientated, where a total of 637.67 Mb sequences were anchored onto chromosomes in proper order and orientation, resulting in a final anchoring ratio of 95.25%. The BUSCO score of this assembly is 97.46%. Further, a total of…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4- —https://doi.org/10.13039/501100001659Deutsche Forschungsgemeinschaft (German Research Foundation)
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPlant-Microbe Interactions and Immunity · Plant Disease Resistance and Genetics · Plant Pathogens and Fungal Diseases
Background & Summary
The genus Malus Mill. of the Rosaceae family comprises the domesticated apple (Malus domestica Borkh.) and its wild relatives^1,2^. Members of this genus, especially the domesticated apple, are highly susceptible to fire blight, which is the most destructive bacterial disease, caused by the gram-negative bacterium Erwinia amylovora^3^ that plague the genus. However, some wild Malus genotypes are resistant and are donors of fire blight resistance loci and genes that have been identified and mapped to the apple genome^4,5^. One such wild genotype is the Oregon crab apple, Malus fusca accession MAL0045 of the genebank of the Julius Kühn Institute in Germany, which of all the M. fusca accessions, was found to be the most resistant to E. amylovora following artificial shoot inoculations with different strains of the bacterium^6–8^. Following chromosome walking studies involving the development of tightly linked markers using published apple genome sequences^9,10^, and sequencing of bacterial artificial chromosome (BAC) clones that span the region of resistance, the locus, FB_Mfu10, was precisely mapped using 1,888 F_1_ individuals and candidate genes were proposed^11,12^.
Furthermore, due to the strong resistance of MAL0045 to highly virulent North American strains of the pathogen e.g. Ea3049, but a large ratio of susceptibility in its F_1_ progeny to Ea3049, it was hypothesized that homozygous loci are present in the MAL0045 genome contributing to its fire blight resistance, which were not detected in the F_1_ progeny. Two resistance loci regions on chromosome 4 and 15 of MAL0045, not detected in the F_1_ progeny, were subsequently identified using F_2_ progeny that contribute to the resistance of MAL0045^13^. The FB_Mfu10 locus and associated candidate genes were analysed using the recently published genome of another M. fusca accession, PI 589975, of the United States Department of Agriculture (USDA) Plant Genetic Resources Unit (PGRU) Malus collection. It was therein hypothesized by the authors of the genome of PI 589975^14^ that copy number variation (CNV) of the G-type lectin S-receptor-like serine/threonine protein kinase genes, which were previously proposed^11^, potentially contributes to the resistance of M. fusca. However, it is important to note that PI 589975 is not the donor of FB_Mfu10, and there is variability of fire blight resistance and susceptibility even amongst different accessions of the same genotype. Therefore, it is imperative to sequence and annotate the genome of MAL0045 and make it available to the scientific community.
Here, we report a chromosome-scale genome sequence using PacBio sequel II platform and Hi-C anchoring of MAL0045 (MfuMAL45).
Methods
Plant material and DNA isolation
Young leaves and buds of Malus fusca accession MAL0045 were harvested from the orchard in spring, immediately frozen in liquid nitrogen, and thereafter stored in −80 °C until required for DNA extraction. High-quality genomic DNA was extracted from leaves using a modified CTAB method according to Almakarem et al.^15^. RNaseA was used to remove RNA contaminants. The quality of the DNA was checked by agarose gel electrophoresis before use in further analyses.
Genome survey, PacBio HiFi (CCS) library construction and sequencing
A short-read library of 350 bp was constructed for genome survey sequencing. Sequencing was performed on an Illumina NovaSeq platform (Illumina, CA, USA), resulting in 32.83 Gb of clean reads.
High molecular weight DNA was sheared into ~ 15 kb fragments using Megaruptor® 2, and SMRTbell library was constructed using the SMRTbell Express Template Prep kit 3.0 (Pacific Biosciences). Briefly, first enzymatic reaction removed single-stranded overhangs from 10 µg of the sheared DNA, which was subsequently treated with repair enzymes. Subsequently, ends of the double-stranded fragments were polished and tailed with an A-overhang at the 3′end. Ligation with T-overhang SMRTbell adapters was performed at 20 °C for 60 minutes. Following ligation, the SMRTbell library was digested by exonuclease and purified with 0.45X AMPure PB beads. The size distribution and concentration of the library were assessed using the FEMTO Pulse automated pulsed-field capillary electrophoresis instrument (Agilent Technologies, Wilmington, DE) and the Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA), respectively. Following library characterization, 3 µg was subjected to a size selection step using the Sage ELF system (Sage Science, Beverly, MA) to collect SMRTbells 15–18 kb. After size selection, the library was purified with 1X AMPure PB beads. Library size and quantity were assessed using the FEMTO Pulse and the Qubit dsDNA HS reagents Assay kit. Sequencing primer and Sequel II DNA Polymerase were annealed and bound, respectively, to the final SMRTbell library. The library was loaded at an on-plate concentration of 55 pM using diffusion loading. SMRT sequencing was performed using a single 8 M SMRT Cell on the Sequel II System with Sequel II Sequencing Kit, 1800-minute movies by Pacific Biosciences (USA). Low-quality reads and sequence adapters were filtered out leading to 28.80 Gb of CCS data. This is a sequencing depth of 49 X with an N50 value of 15.54 kb, and average read length of 15.06 Kb (Table 1). The distribution of read length is summarized in Table 2.Table 1. Summary of PacBio HiFi sequencing.Data_TypeReads_numReads_baseReads_LenN50Reads_LenMeanReads_LenMaxCCS1,913,13428,804,521,56815,54215,05650,161Reads_num: Counts of reads; Reads_bases(bp): Total base count; Reads_LenN50(bp): Sequence length of the shortest reads at 50% of total bases; Reads_LenMean(bp): Average read length; Reads_LenMax(bp): Longest read length. bp = base pair.Table 2. Read length distribution.LengthReadsNumberTotalLengthAverLength50020001,1522,059,5141787.77200040008,17126,752,1993274.044000600013,18965,837,5204991.856000800012,92490,657,2217014.6480001000046,128434,465,9439418.701000012000330,7553,685,290,28611142.051200014000434,8215,651,084,25012996.341400016000379,8845,682,619,94614958.821600018000278,8414,723,254,81716938.8818000407,2698,442,499,87220729.54Note: Length: Read length range; ReadsNumber: Number of reads within corresponding read length range; TotalLength: Total length of reads within corresponding length range; AverLength: Average length of reads within corresponding length range.
We used short reads that were generated from the Illumina platform for the estimation of the genome size, the level of heterozygosity and repeat content of the genome. Long reads from the PacBio platform were used for genome assembly.
Genome assembly by CCS data and features estimation from K-mer
We assembled the 28.80 Gb of CCS data using Hifiasm (v 0.16) software^16^ resulting in a genome containing 575 contigs. Short-reads from the Illumina platform were quality filtered by fastp^17^ using the parameters -q 10 -u 50 -y -g -Y 10 -e 20 -l 100 -b 150 -B 150. We counted the 21-kmers with Jellyfish software^18^ and calculated the genome characteristics using Genomescope software. The genome size of MfuMAL45 was estimated to be 587.60 Mb. The heterozygosity of the genome was 0.98%. A k-mer distribution map with k = 21 is shown in Fig. 1.Fig. 1. Frequency distribution of coverage and K-mer numbers.
Hi-C sequencing and Contigs anchoring
For Hi-C based contig anchoring, a fragment library from 300–700 bp insert size was constructed and sequenced through Illumina NovaSeq platform resulting in 70.77 Gb clean data following the removal of adapter sequences and low quality reads. We mapped the 70.77 Gb to the MfuMAL45 genome using BWA (bwa-0.7.17) with the default parameters. We successfully clustered 548 contigs into 17 groups with the agglomerative hierarchical clustering method in Lachesis^19^. In addition, clustered contigs were ordered using Lachesis^19^. Subsequently, 669.46 Mb sequences were anchored onto 17 chromosomes, which took up 99.75% of the total contig length (Table 3). We obtained the chromosomal-level high-quality assembly in anchored sequences of 637.67 Mb in confirmed order and orientation, which took up 95.25% of the total sequence length. Statistics on Hi-C contigs assembly is shown in Table 3. Hi-C interaction heatmap within the chromosomes is shown in Fig. 2.Table 3. Statistics of Hi-C contig anchorage.Linkage GroupCluster lengthOrder lengthLG154,374,75154,301,286LG246,601,24336,912,530LG344,894,74944,691,075LG441,986,52741,652,006LG546,350,87641,110,250LG640,583,97240,251,138LG738,508,17438,338,780LG837,398,22837,350,734LG937,340,03337,234,748LG1036,572,40636,489,418LG1135,784,90135,302,691LG1235,696,90335,641,590LG1347,364,43532,713,194LG1433,296,93933,196,904LG1532,289,06432,251,003LG1632,215,89532,137,758LG1728,205,52628,093,631Total669,464,622 (99.75%)637,668,736 (95.25%)Fig. 2. Hi-C interaction heatmap between the assembled 17 chromosomes of Malus fusca MfuMAL45 genome.
Annotation of repetitive sequences
The following workflows annotated transposon element (TE) and tandem repeat. We identified TEs by a combination of homology-based and de novo approaches. Firstly, we used RepeatModeler (http://www.repeatmasker.org/RepeatModeler/)^20^ to customize a de novo repeat library of the MfuMAL45 genome. Subsequently, full-length long terminal repeat retrotransposons (fl-LTR-RTs) were identified with LTRharvest (v1.5.9)^21^ and LTR_finder (v2.8)^22^. The fl-LTR-RTs and LTR library were thereafter produced by LTR_retriever^23^. A non-redundant species-specific TE library was constructed by combining the de novo TE sequences library using the Dfam (v3.5) database. Finally, TE sequences in the MfuMAL45 genome was identified and grouped by homology-based search in the library using RepeatMasker (v4.12)^24^. Tandem repeats were annotated with Tandem Repeats Finder (TRF v4.09)^25^ and the MIcroSAtellite Identification Tool (MISA v2.1)^26^. A total of 366,217,460 bp TE sequence was identified, which resulted in 54.57% of the entire genome. Detailed information on TE prediction is provided in Table 4.Table 4. Statistics of TE sequences.TypeNumberLengthRate (%)ClassI:Retroelement426,652301,667,10944.95ClassI/LINE29,7248,256,9751.23ClassI/LTR/Caulimovirus1,1111,440,7680.21ClassI/LTR/Copia85,52583,190,56512.40ClassI/LTR/ERV3,999326,0710.05ClassI/LTR/Gypsy83,581106,889,61315.93ClassI/LTR/Ngaro19312,9610.00ClassI/LTR/Pao2,082478,0320.07ClassI/LTR/Unknown201,06997,371,49214.51ClassI/SINE19,3683,700,6320.55ClassII:DNA transposon256,68764,548,6649.62ClassII/Academ2800.00ClassII/CACTA2,3851,399,8360.21ClassII/Crypton176210.00ClassII/Dada20711,2340.00ClassII/Ginger221,2520.00ClassII/Helitron130,40633,638,3655.01ClassII/IS3EU1468,0730.00ClassII/Kolobok20920,3630.00ClassII/Maverick1036,1490.00ClassII/Merlin904,6850.00ClassII/Mutator46631,6220.00ClassII/P592,8930.00ClassII/PIF-Harbinger4,5082,124,2970.32ClassII/PiggyBac281,3310.00ClassII/Tc1-Mariner21012,1380.00ClassII/Unknown112,82724,736,7393.69ClassII/Zisupton883,9780.00ClassII/hAT4,9142,545,0080.38Unknown311,6870.00Total683,370366,217,46054.57
Annotation of protein coding genes
We integrated three approaches, namely, de novo prediction, homology search, and transcript-based assembly, to annotate protein-coding genes in the genome. The de novo gene models were predicted using two ab initio gene-prediction software tools, Augustus (v3.1.0)^27^ and SNAP(2006-07-28). For the homolog-based approach, GeMoMa (v1.7) software^28^ was performed by using reference gene models from the other 4 species/genome versions (Arabidopsis thaliana v10, Malus domestica GDDH13/ HFTH1/Mfusca_hap1/hap2). For the transcript-based prediction, RNA-sequencing data was mapped to the reference genome using Hisat (v2.1.0)^29^ and assembled using Stringtie (v2.1.4)^30^. GeneMarkS-T (v5.1) was used to predict genes based on the assembled transcripts. We used PASA (v2.4.1) software to predict genes based on the unigenes. Full-length transcripts from the PacBio were assembled using Trinity (v2.11)^31^. The gene models identified from both approaches were combined using the EVM software (v1.1.1) and updated by PASA. In total, 47,174 protein-coding genes were predicted in the assembled MfuMAL45 genome using the criteria above (Fig. 3).Fig. 3. Predicted protein-coding genes from the assembled MfuMAL45 genome following ab initio (blue), homology-based (green) and transcriptome-based (yellow) analyses. Figure also shows genes the number of genes that were predicted in common by respective analysis methods.
Functional annotation of protein coding genes
We determined gene functions by aligning protein sequences to the National Center for Biotechnology Information (NCBI) Non-Redundant (NR), EggNOG^32^, KOG, TrEMBL^33^, InterPro^34^ and Swiss-Prot^33^ protein databases using diamond blastp (diamond v2.0.4.142) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) database^35^. The protein domains were annotated using InterProScan (v5.34–73.0)^36^. The PFAM database^37^ was used to identify motifs and gene domains. Gene Ontology (GO) IDs for each gene were obtained from TrEMBL, InterPro and EggNOG. In total, approximately 46,440 (about 98.44%) of the predicted protein-coding genes of MfuMAL45 could be functionally annotated with known genes, conserved domains, and Gene Ontology terms, shown in Fig. 4 and Table 5.Fig. 4. Functional annotation of predicted protein-coding genes from the assembled MfuMAL45 genome using EggNOG (a) up, KOG (b) middle, and GO (c) down.Table 5. Statistics of Gene function annotation.Anno_DatabaseAnnotated_NumberAnnotated_RatioGO_Annotation35,53775.33KEGG_Annotation30,37864.4KOG_Annotation21,81846.25Pfam_Annotation34,12172.33Swissprot_Annotation30,91065.52TrEMBL_Annotation46,36398.28eggNOG_Annotation36,31076.97nr_Annotation45,54096.54All_Annotated46,44098.44
Annotation of Non-coding RNA genes
We used tRNAscan-SE (v1.3.1)^38^ algorithms with default parameters to identify the genes associated with tRNA. For rRNA identification, we used barrnap (v0.9) with default parameters to identify the genes associated with rRNA. MiRNAs and snRNAs were identified by Infernal (v1.1)^39^ software against the Rfam (v14.5)^37^ database with default parameters. A total of 3,476 tRNA, 7,845 rRNA, 123 miRNA, 109 snRNA and 104 snoRNA were identified.
Pseudogene prediction
Pseudogenes were identified by scanning the genome using GenBlastA (v1.0.4) program^40^ following the prediction of functional genes. Putative candidates were then analyzed by searching for non-mature mutations and frame-shift mutations using GeneWise (v2.4.1)^41^ leading to the identification of 262 pseudogenes with a total length of 808, 235 bp and an average length of 3,084.87 bp.
Data Records
The raw data are deposited in the European Nucleotide Archive (ENA) database under study accession PRJEB77885^42^ with accession numbers of Illumina DNA short reads (ERR14104108), RNA (ERR14104109), Hi-C short reads (ERR14104110) and PacBio reads (ERR14104112). The final assembly, repetitive sequences, structural and functional gene annotations are available from figshare^43^ and the European Nucleotide Archive (ENA)^44^.
Technical validation (genome quality evaluation)
The assembled genome was also subjected to BUSCO v5.2.2 with the OrthoDB to evaluate the completeness of the genome^45^. BUSCO score of this assembly is 97.46%. Overall, 98.08% complete and 0.74% partial of the BUSCOs were identified in the assembled genome. In addition, we aligned short reads from the Illumina platform to the genome, which resulted in a high alignment ratio that demonstrated the high quality of contig assembly. The Cegma (v2.5) was used to evaluate the integrity of the final genome assembly. We also used bwa to align short reads (Illumina) with current genome assembly to assess the completeness and distribution of these reads on current assembly based on mapping ratio, genome coverage and sequencing depth distribution. The statistics is shown in Table 6. Minimap2^46^ was applied to align Third-generation sequencing (TGS) data, Hifi reads, back to the assembly to assess its completeness and evenness of sequencing data. Statistic on TGS data alignment showed that of 1,913,134 total reads, 1,907,844 were mapped representing 99.72%. For gene prediction, embryophyta database in BUSCO^45^ containing 1,614 core genes was employed. Using BUSCO v4.0 to assess the integrity of gene prediction, 98.14% of BUSCO core genes were identified in predicted gene list, which indicates a high integrity.Table 6. Statistics on NGS data alignment and Sequencing depth and coverage.Total_readsMapped_readsMapped (%)Properly_mapped_readsProperly_mapped (%)238,059,580237,036,44599.57223,841,73494.03**Average depthCoverageCoverage(≥5X)****Coverage(≥10X)****Coverage(≥20X)**4999.9299.7199.2895.83
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Hanke, M. V. et al. Malus × domestica apple. In Biotechnology of fruit and nut crops (pp. 440–473). Wallingford UK: CAB International (2020).
- 2Grabherr, M. G. et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nat Biotechnol, 29, 644.10.1038/nbt.1883 PMC 357171221572440 · doi ↗ · pubmed ↗
