Chromosome-level assembly of the Isodon lophanthoides genome
Yubang Gao

Abstract
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2- —Nanyang Normal University 10.13039/501100004943
- —Natural Science Foundation of Henan Province 10.13039/501100006407
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Myxozoan Parasites in Aquatic Species · Genetic diversity and population structure
Introduction
1
Isodon lophanthoides (Figure 1A) is a perennial herb of the Lamiaceae family distributed across China, India, Myanmar, Nepal, and Vietnam (Wen et al., 2011; Zhang et al., 2022). I. lophanthoides contains a variety of bioactive compounds, such as terpenoids, flavonoids, phenolics, and polysaccharides (Lin et al., 2008; Wen et al., 2011; Zhou et al., 2014). I. lophanthoides is traditionally used to alleviate symptoms of acute jaundice hepatitis, arthritis, cholecystitis, enteritis, pharyngitis, ascariasis, and leprosy (Jiang et al., 2000). This herb is utilized in the preparation of therapeutic teas and instant granules. Additionally, it is used as an ingredient in soups and cooking. This plant plays a significant role in traditional Chinese medicine. It is cultivated extensively as a commercial raw material for the medicinal product “Xihuangcao”. The absence of genomic resources for I. lophanthoides has severely limited its genetic improvement and research on its active components. In this study, we assembled the first chromosome-level genome of I. lophanthoides and identified key genes involved in terpene biosynthesis. This work provides a valuable foundation for genetic improvement and exploring its active compounds’ biosynthetic pathways.
Chromosome-scale assembly of the I. lophanthoides genome. (A) Contact map of I. lophanthoides genome. (B) Circos plot displaying the 12 chromosomes in the I. lophanthoides genome. a. Length of each pseudochromosome (Mb). b. Distribution of repetitive sequences. c. Distribution of gene density. d. Distribution of the GC content. e. The phenotype of I. lophanthoides (The flower pot size was 15 cm).
Materials and methods
2
Material collection and genome sequencing
2.1
Young leaves of I. lophanthoides, cultivated at the Artemisia Engineering Technology Center of Nanyang Normal University, were collected to extract high-quality DNA for genome sequencing. After DNA extraction, ultrasonic shearing was applied. The sequencing library was prepared through end-repair, adapter ligation, and amplification, followed by sequencing on the DNB-Seq T7 platform. For long reads, the sequencing library was prepared using the Oxford Nanopore ligation sequencing kit (SQK-LSK109). Sequencing was then performed on an R9 flow cell on the PromethION platform. For Hi-C reads, DNA was fixed in a 4% formaldehyde solution. Digestion was performed with the MboI enzyme, and digested fragments were labeled with biotin-14-dCTP. The crosslinked fragments were then blunt-end repaired and sequenced on the DNB-Seq T7 platform.
Genome survey
2.2
The k-mer method was used to estimate genome size and heterozygosity before genome assembly. The k-mer distribution was calculated from short reads using Jellyfish (Marcais and Kingsford, 2012) with k-mer length set to 21. The genome size and heterozygosity rate was estimated using the GenomeScope2 (Ranallo-Benavidez et al., 2020).
Genome assembly and gene annotation
2.3
Genome assembly was conducted using NextDenovo (Hu et al., 2024) with the overlap-layout-consensus algorithm and default parameters. NextPolish (Hu et al., 2020) was used to polish the genome assembly, applying two rounds of long-read and four rounds of short-read data correction. Hi-C reads were aligned to contigs using Juicer (Durand et al., 2016) and BWA (Jung and Han, 2022), after which the 3D-DNA pipeline (Dudchenko et al., 2017) corrected misassemblies and ordered contigs, integrating them into scaffolds. Manual inspection of scaffolds was then performed using Juicebox Assembly Tools. The final chromosome-length scaffolds were constructed using the 3D-DNA pipeline, with all computational tools run using default parameters. Misassemblies were identified and corrected based on irregular contact patterns in Hi-C data.
Repeat elements in genomes were identified using RepeatModeler (Flynn et al., 2020), and the repeat library was then processed with RepeatMasker (Tarailo-Graovac and Chen, 2009) to annotate repeats across the genome. Transposable elements (TEs) were classified using TEsorter (Zhang et al., 2022). Simple sequence repeat (SSR) markers were predicted using MISA (Beier et al., 2017). Protein-coding genes in the I. lophanthoides genome were identified using an integrative strategy that combined ab initio prediction, protein homology searches, and RNA sequencing data. For ab initio prediction, we used Augustus (Stanke et al., 2006), SNAP (Korf, 2004), GlimmerHMM (Majoros et al., 2004), and GeneMark-ET (Brůna et al., 2020) to identify gene structures in the repeat-masked genome. For protein homology prediction, protein data from sequenced Lamiaceae species were downloaded from the NCBI database and aligned for homology assessment. Additionally, HISAT2 (Kim et al., 2019) was used to map RNA-seq data (PRJNA679679) from various tissues to the genome. PASA was used to predict open reading frames. EVidenceModeler (Haas et al., 2008) integrated results from the three methods, enabling a unified gene prediction. Functional annotation was performed using BLAST (Ye et al., 2006) against NR, SwissProt, eggNOG, InterPro, GO, and KEGG databases. Functional annotations for protein-coding genes were integrated using the above methods.
Phylogenetic analysis
2.4
Protein sequences of A. trichopoda, O. sativa, V. vinifera, T. cacao, A. thaliana, S. lycopersicum, C. canephora, T. grandis, L. japonicus, S. miltiorrhiza, I. rubescens, and A. decumbens were downloaded for subsequent analyses. OrthoVenn3 (Sun et al., 2023) was used for orthology, phylogenetic, and gene family analyses. Pairwise sequence similarity was determined using BLASTP and OrthoMCL (Li et al., 2003) Markov clustering. Phylogenetic trees were constructed using FastTree2 (Price et al., 2010) with the maximum likelihood method and the JTT+CAT model, with node reliability assessed by the SH test. A divergence tree was constructed using single-copy genes and fossil evidence. Divergence times between A. thaliana and T. cacao, S. lycopersicum and C. canephora, A. thaliana and V. vinifera, A. trichopoda and V. vinifera, and L. japonicus and T. grandis were estimated using r8s (Sanderson, 2003). CAFE (Mendes et al., 2020) was used to compare cluster size differences between ancestors and each species to determine gene family expansions and contractions. A random birth-and-death model was applied to assess gene family changes across lineages in the phylogenetic tree. Conditional likelihood was used as the test statistic, with p-values of ≤ 0.01 considered significant.
Duplicated gene analysis
2.5
I. lophanthoides protein sequences were compared to identify homologous blocks. The MCScanX (Wang et al., 2012) pipeline was applied with default settings to map homologous blocks within species. The YN model in KaKs_Calculator 2.0 (Wang et al., 2010) was used to calculate nonsynonymous (Ka) and synonymous (Ks) substitution rates, as well as their ratio (Ka/Ks), for duplicate gene pairs.
Data
3
Genome assembly
3.1
DNA was isolated from I. lophanthoides samples cultivated in the laboratory. Genome size and heterozygosity were estimated using DNB short-read sequencing data. The estimated genome size from short reads was 365,686,342 bp, with a heterozygosity rate of 0.64% (k-mer length = 21). DNA from the same plant was used to assemble the I. lophanthoides genome with a combination of Nanopore and Hi-C technologies (Supplementary Table S1). Assembly with Nanopore long reads produced a genome with a total length of 379,974,750 bp, containing 70 contigs (N50 = 17,265,197 bp). After Hi-C scaffolding, 378,710,417 bp (99.67%) of the sequence was placed into 12 linkage groups (Figure 1A). These linkage groups corresponded to the 12 chromosomes of I. lophanthoides (N50 = 32,786,395 bp). BUSCO assessment showed that the assembly covered 98% of the single-copy orthologs in the embryophyta_odb10 database (1,614 genes; Supplementary Table S2). The consensus quality value (QV) was 35.77, indicating that the genome is highly accurate. The genome’s LAI value is 13.78, reaching the level of the reference genome.
Gene prediction and gene annotation
3.2
50.52% of the genome assembly consisted of repetitive elements, with half of this proportion (30% of the genome) being retrotransposons. This retrotransposon content is similar to that in I. rubescens. In the I. lophanthoides genome, 9.38% of the copies were identified as Copia elements, and 9.93% as Gypsy elements. We further classified transposable elements (TEs) using Tesort (Zhang et al., 2022), identifying 5,880 Helitrons, 4,015 LINEs, 94,428 LTRs, and 13,042 TIRs. Additionally, 153,599 SSR markers were predicted using MISA (Beier et al., 2017).
EVidenceModeler was used to integrate outputs from transcriptome data, ab initio predictions, and homology-based predictions. A total of 30,641 genes were identified, of which 28,541 were protein-coding (Figure 1B). These genes contained an average of 4.8 exons, with an average coding sequence (CDS) length of 1,112 bp (Supplementary Table S3). Functional annotation of 26,492 protein-coding genes (92.8%) was achieved using GO, NR, KEGG, TAIR, and InterProScan databases. A total of 40 genes were associated with terpene metabolism, including 12 genes in the MEA pathway and 28 in the MEP pathway (Supplementary Table S4). Non-coding RNA prediction identified 297 rRNAs, 541 tRNAs, 101 miRNAs, and 341 snRNAs.
Comparative genomic analysis of I. lophanthoides with other plants
3.3
To determine the evolutionary relationships between I. lophanthoides, I. rubescens, and other plant species, a phylogenetic tree was constructed using a total of 427,238 proteins from 12 plant species (Supplementary Table S5). These proteins were clustered into 35,165 orthogroups, of which 282 were single-copy genes (Supplementary Table S6). With known divergence times added, the phylogenetic tree indicated that the common ancestor of I. lophanthoides and I. rubescens diverged approximately 12.988 million years ago (MYA) (Figure 2A). In I. lophanthoides, 48 gene families showed significant expansion and 208 showed significant contraction. The number of expanded gene families was smaller than in I. rubescens. Compared with other Lamiaceae species, I. lophanthoides had the fewest unique gene families (Figure 2B). A transposon burst occurred in I. rubescens gene families around 1 MYA (Figure 2C). The Ks method was used to analyze orthologous gene pairs, revealing no lineage-specific whole-genome duplication events other than the shared peak in Lamiaceae (Figure 2D). Further analysis of selection-affected genes identified 323 genes under positive selection and 2,832 under negative selection (Figure 2E). Genes under positive selection were enriched in processes such as “response to salicylic acid” (Supplementary Figure S3).
Evolutionary analysis of the I. lophanthoides genome. (A) A phylogenetic tree based on shared single-copy gene families, gene family expansions, and contractions among I. lophanthoides and ten other species. The bar chart on the right displays gene family clustering in I. lophanthoides and ten other plant species. (B) Venn Diagram Representation of Gene Family Overlaps and Specificities Among I. lophanthoides, I. rubescens L. japonicus, T. grandis, and S. miltiorrhiza in Labiatae. (C) Density plot showing the burst of LTR-RTs in I. lophanthoides. (D) Ks value distribution plot for orthologous gene sets of I. lophanthoides. (E) Ka/Ks value distribution plot for orthologous gene sets of I. lophanthoides.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Beier S. Thiel T. Münch T. Scholz U. Mascher M. J. B. (2017). MISA-web: a web server for microsatellite prediction. Bioinformatics. 33 (16), 2583–2585. doi: 10.1093/bioinformatics/btx 198 28398459 PMC 5870701 · doi ↗ · pubmed ↗
- 2Brůna T. Lomsadze A. Borodovsky M. (2020). Gene Mark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins. NAR Genomics Bioinf. 2, lqaa 026. doi: 10.1093/nargab/lqaa 026 PMC 722222632440658 · doi ↗ · pubmed ↗
- 3Dudchenko O. Batra S. S. Omer A. D. Nyquist S. K. Hoeger M. Durand N. C. . (2017). De novo assembly of the Aedes a Egypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95. doi: 10.1126/science.aal 3327 28336562 PMC 5635820 · doi ↗ · pubmed ↗
- 4Durand N. C. Shamim M. S. Machol I. Rao S. S. Huntley M. H. Lander E. S. . (2016). Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst. 3, 95–98. doi: 10.1016/j.cels.2016.07.002 27467249 PMC 5846465 · doi ↗ · pubmed ↗
- 5Flynn J. M. Hubley R. Goubert C. Rosen J. Clark A. G. Feschotte C. . (2020). Repeat Modeler 2 for automated genomic discovery of transposable element families. Proc. Natl. Acad. Sci. 117, 9451–9457. doi: 10.1073/pnas.1921046117 32300014 PMC 7196820 · doi ↗ · pubmed ↗
- 6Haas B. J. Salzberg S. L. Zhu W. Pertea M. Allen J. E. Orvis J. . (2008). Automated eukaryotic gene structure annotation using E Vidence Modeler and the Program to Assemble Spliced Alignments. Genome Biol. 9, 1–22. doi: 10.1186/gb-2008-9-1-r 7 PMC 239524418190707 · doi ↗ · pubmed ↗
- 7Hu J. Fan J. Sun Z. Liu S. (2020). Next Polish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics 36, 2253–2255. doi: 10.1093/bioinformatics/btz 891 31778144 · doi ↗ · pubmed ↗
- 8Hu J. Wang Z. Sun Z. Hu B. Ayoola A. O. Liang F. . (2024). Next Denovo: an efficient error correction and accurate assembly tool for noisy long reads. Genome Biol. 25, 1–19. doi: 10.1186/s 13059-024-03252-4 38671502 PMC 11046930 · doi ↗ · pubmed ↗
