CDMMM: a comprehensive platform of traditional Indian medicinal plant DNA barcodes and metabolite fingerprints database
Chigateri M. Vinay, Akshay Pramod Ware, Kannath U. Sanjay, Debyani Samantray, Reju R. Krishnan, Keyur Raval, Mahendran Sekar, Yarappa Lakshmikanth Ramachandra, Bobby Paul, Padmalatha S. Rai

TL;DR
CDMMM is a new database for Indian medicinal plants that includes DNA barcodes and metabolite data to help identify and authenticate herbal medicines.
Contribution
CDMMM provides a comprehensive, experimentally validated database for Indian medicinal plants with DNA barcodes and metabolite fingerprints.
Findings
CDMMM includes 89 DNA barcode accessions from 67 plant species.
The database contains 3033 annotated metabolites and 1414 therapeutic targets linked to 441 diseases.
CDMMM supports taxonomy, species identification, and drug discovery for traditional Indian medicinal plants.
Abstract
Herbal medicines, derived from medicinal plants, are in high demand due to global population growth and the increasing prevalence of chronic diseases; however, the use of substitutes or adulterants can compromise the quality of these medicines. DNA barcoding and metabolite fingerprinting are used to identify plants and ensure the safety of drugs. The effectiveness of authentication methods depends on the availability and coverage of the reference library. However, reference DNA barcodes and metabolite fingerprint libraries for traditional Indian medicinal plants are lacking, which hinders the authentication of herbal drugs and the elucidation of the therapeutic effects of secondary metabolites. In the present study, we developed a user-friendly ‘Comprehensive Database of Medicinal Plants, Molecular Markers, and Metabolite Fingerprinting (CDMMM)’ that provides extensive details on…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6- —Manipal Academy of Higher Education, Manipal
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTraditional Chinese Medicine Analysis · Biological and pharmacological studies of plants · Berberine and alkaloids research
Introduction
Herbal medicines rely on medicinal plants to treat a variety of ailments^1^. The population expansion and increasing prevalence of chronic diseases impact the growth of the herbal medicine market. The global herbal medicine market was valued at USD 165.13 billion in 2023 and is projected to reach USD 386.07 billion by 2032. However, the herbal medicine market in India was valued at USD 60.52 billion in 2023 and is projected to grow to USD 110.08 billion by 2032^2,3^. India plays a significant role in the global pharmaceutical industry, with plant-based compounds forming the foundation for a substantial portion of its drug market. Approximately 25% of all prescription drugs worldwide, and a remarkable 70% of modern medications within India itself, originate from plant sources.^4^. This wealth of botanical material has propelled India to become the second-largest exporter of medicinal plants and raw materials globally^5^. The National Medicinal Plant Board of India listed 960 medicinal plants in trade, among which 178 plants consume more than 100 tons every year^6^. The demand for herbal medicines has drastically increased since the COVID-19 pandemic in India^7^. The increasing global demand for herbal medicines necessitates robust quality control measures to ensure their efficacy and safety, especially due to the use of substitutes and adulterants^8^.
Traditional methods of identification include morphological, organoleptic, microscopic, and chemical-based methods. However, these techniques lead to the misidentification of closely related species and lack sensitivity in identifying the original plant materials in herbal drugs^9^. With advanced sequencing technologies, DNA barcodes are more efficient at accurately authenticating medicinal plants and raw drugs than conventional authentication methods^10^. DNA barcodes are short DNA sequences from the genome used in taxonomic research^11^. Some candidate markers, such as ITS, psbA-trnH, rbcL, trnL-trnF, and matK, are widely used in plant species identification^12^. Several public databases, such as GenBank^13^ and the Barcode of Life Data System (BOLD)^14^, serve as critical resources for genomic authentication. However, the effectiveness and accuracy of DNA barcode-based species identification depend on the coverage of the amplicon by the reference sequence and the base quality^10^. Some specialized databases, such as the MMDBD (Medicinal Materials DNA Barcode Database), provide a dedicated platform for Traditional Chinese Medicine (TCM) plant DNA barcode sequences, thereby enhancing the identification and authentication of raw drugs^15^. However, a comprehensive digital platform of reference DNA barcode sequences for Indian medicinal plants is not available for reliable identification and authentication of raw drugs.
Medicinal plants produce various metabolites that are essential for plant physiology through signaling pathways and biochemical reactions^16^. Currently, several broad-spectrum metabolite databases exist for metabolome research, including MassBank^17^, the Human Metabolome Database (HMDB)^18^, Metlin^19^, and the Metabolomics Workbench^20^. Additionally, plant-specific metabolome databases, such as the plant metabolic network^21^, Golm metabolome database^22^, RefMetaPlant^23^, and the ReSpect database^24^ existed. However, even these resources face challenges in terms of their comprehensiveness and applicability to a wider range of plant species, specifically Indian medicinal plants listed in the Ayurvedic pharmacopoeia of India and their substitutes. The integration of high-performance liquid chromatography and high-resolution mass spectrometry has enabled large-scale metabolite identification with increased sensitivity, resolution, and accuracy^25^.
Several plant secondary metabolites can serve as therapeutic agents for treating diseases^26^. For example, plant bioactive compounds such as Taxol and digoxin are used to treat cancer and heart disease, respectively^27^. Recently, various databases, such as the Bioinformatics Analysis Tool for Molecular Mechanism of TCM, Traditional Chinese Medicine Systems Pharmacology (TCMSP), TCMAnalyzer, TCM@Taiwan, and TCM-Suite databases, have provided information on therapeutic targets and diseases related to TCM plants^28–32^. However, the exploration of bioactive compounds from traditional Indian medicinal plants and their therapeutic targets associated with human diseases is lacking. An integrated network pharmacology approach can help elucidate the therapeutic potential of bioactive compounds in treating various human diseases^33^. Furthermore, an online platform to decipher the biological role of secondary metabolites that can act as lead molecules will enhance the drug discovery process and aid in the treatment of life-threatening diseases.
To address these research gaps, we developed a user-friendly expandable data resource, the CDMMM (Comprehensive Database on medicinal plants, molecular markers, and metabolite fingerprints), which provides information on traditional Indian medicinal plants. Advanced search and browsing options were provided for users to explore information on medicinal plants, DNA barcode sequences, annotated plant metabolites, and potential therapeutic targets associated with various diseases.
Results
CDMMM database contents and structure
The CDMMM is an open-access online expandable platform comprising a medicinal plant species module, a reference DNA barcode module, a plant metabolome module, and a molecular target-associated human disease module. All these data modules were integrated into an interactive CDMMM database built with a hypertext markup language (HTML), hypertext preprocessor (PHP), JavaScript, and MySQL. A schematic representation of the CDMMM database architecture is shown (Fig. 1).Fig. 1. Schematic workflow of the methodology employed in this study.
The medicinal plant species module comprises information on 67 highly traded Indian medicinal plant species and their substitutes used in drug formulations, which were collected from different geographical locations (Fig. 2, Supplementary Table S1). The species page module comprises the taxonomical information of plant species obtained from the NCBI taxonomy database, vernacular names, medicinal parts used in the preparation of herbal formulations, plant descriptions, distributions, and medicinal uses for each plant species listed in the Ayurvedic pharmacopeia of India and other research journals (Supplementary Table S1).Fig. 2. Geographical sampling sites of selected medicinal plants from Karnataka, Gujarat, and Tamil Nadu states in India. The sampling map was created in QGIS v3.34 [https://qgis.org/].
The DNA barcode module provides the sequences of the barcode regions amplified from the 67 plant species. The DNA barcode sequence database consists of 76 nrITS barcodes from 54 plant species that were successfully amplified and sequenced, 8 psbA-trnH barcodes from eight plant species, 2 rbcL barcodes from two species, 2 trnL-trnF barcodes from two species, and 1 matK barcode from Syzygium cumini (Supplementary Fig. S1-S5, Supplementary Table S2). The phylogenetic relationships of the 89 universal DNA barcodes are illustrated (Fig. 3). The phylogenetic analysis revealed that species from different markers were present in the appropriate clades, with strong support (bootstrap support ≥ 70%). Currently, the DNA barcode reference library module provides the following information: i) specimen ID as the GenBank accession number, ii) specimen voucher, iii) georeferenced data, iv) universal marker used, and v) DNA barcode sequence of each sample.Fig. 3. Phylogenetic construction of DNA barcodes via universal markers with RaxML v8.0 software. The DNA barcodes with the nrITS marker are represented in the green clade, the rbcL marker is represented in yellow, the trnL-trnF marker is represented in blue, matK is represented in red, and the psbA-trnH marker is represented in pink. The outside labels were color-coded according to the DNA barcode family.
The reference plant metabolome was constructed from 20 highly traded plant species (ten authentic and ten substitutes) via mass spectral data obtained from Agilent HPLC-QTOF-MS/MS and Waters UPLC-QTOF-MSE platforms. The raw spectral data from both systems for the plant species were preprocessed and filtered into high-quality features via MS-DIAL v4.9 [https://systemsomicslab.github.io/compms/msdial/main.html]. The unique MS2 spectral features obtained in the ESI positive mode of ionization were annotated with compound spectral libraries present in MS-Finder v3.60 [https://systemsomicslab.github.io/compms/msfinder/main.html]. A total of 3033 unique annotated metabolites were obtained from 20 plant species (Supplementary Table S3). Furthermore, these annotated metabolites were classified into 14 superclasses and 181 compound classes via the ClassyFire webserver [https://classyfire.wishartlab.com/] (Fig. 4). The majority of annotated metabolites are prenol lipids, flavonoids, organooxygen compounds, carboxylic acids and derivatives, fatty acyls and steroids, and steroid derivatives. Currently, the plant reference metabolome module provides mass spectrometry data, identifiers, and compound classification data for each annotated metabolite from the corresponding plant species. The module is also cross-linked to external databases, such as HMDB [https://www.hmdb.ca/], KNApSACK [https://www.knapsackfamily.com/knapsack_core/top.php], FooDB [https://foodb.ca/], DrugBank [https://go.drugbank.com/], PubChem [https://pubchem.ncbi.nlm.nih.gov/], PlantCyc v16.0.3 [https://pmn.plantcyc.org/cpd-search.shtml], and CoconutDB [https://coconut.naturalproducts.net/], for easy navigation and collection of related information. ADMET properties were calculated for 3,028 compounds, and five compounds were not computed via the QikProp module because of technical errors in the software or errors in the input structures. Among the 3,028 compounds, 77.67% of the pharmacokinetic properties fell within the 95% acceptable range for recognized drugs, whereas 1114 compounds presented #stars = 0 (Supplementary Fig. S6).Fig. 4. Classification of tentative annotated metabolites into major superclasses and classes. The inside ring represents the superclass, and the outside ring represents the classes identified via the Classyfire webserver.
Molecular human protein targets were predicted for annotated metabolites using reference metabolome libraries such as SwissTargetPrediction [https://www.swisstargetprediction.ch/], the similarity ensemble approach prediction tool, and the therapeutic target database. A total of 1414 predictive targets were obtained for 2685 tentative metabolites. These predictive targets were characterized into clinical trials (434), successful (408), literature-reported (407), patented recorded (123), preclinical (21), and 21 listed in other target categories (Supplementary Table S4). Furthermore, the analysis identified 931 predictive targets associated with 441 diseases that belong to 25 disease ICD-11 codes (Supplementary Table S5). Currently, the target-associated human disease module provides the following information: (i) target name, (ii) UniProt ID, (iii) gene name, (iv) disease name, and (v) therapeutic target database ID.
Database functionalities and web interface
The CDMMM database provides a user-friendly interface for users to retrieve relevant information via search or browsing options (Fig. 5). The search option in the CDMMM interface allows users to search three fields: i) medicinal plants (botanical names), ii) plant metabolites (compound names), and iii) therapeutic targets (human diseases). Additionally, a quick-browsing option is provided in the interface with navigation tabs named ‘Medicinal Plants’, ‘Compounds’, ‘Therapeutic targets’, and ‘BLAST’.Fig. 5. The graphical user interface of the CDMMM database for searching and browsing functions.
The navigation menu ‘Medicinal Plants’ allows users to select the medicinal plants being used in drug formulation, and details of selected medicinal plants, such as taxonomy, synonyms, vernacular names, habits, habitat, morphology, distribution, and medicinal uses, are shown (Fig. 6A). The users can also access the plant–metabolite associations and DNA barcode sequence for each plant species by clicking the relevant tab options on the medicinal information page (Fig. 6C). The analytical method information of the selected plant species is listed on the ‘analytical method’ page under the plant–metabolite association tab. The ‘compounds’ page provides the annotated metabolites of reference plant metabolomes in tabular format. Furthermore, the pharmacokinetic properties of each annotated metabolite are provided on the information page of the compound. The user can access the detailed information of each annotated metabolite by browsing the hyperlink provided (Fig. 6B). Users can also view the details of the targets associated with each annotated metabolite and human disease by clicking the relevant tab options on the compound’s information page (Fig. 6D and Fig. 6E). The page to browse ‘therapeutic targets’ provides a list of therapeutic targets of the reference metabolome and associated diseases in tabular format. The ‘BLAST’ option is provided to perform a similarity search of nucleotide sequences against the reference DNA barcode library in the CDMMM database. This helps users accurately identify the medicinal plants used in drug formulations. We performed several case studies to demonstrate the accurate usage and application areas of the online platform.Fig. 6CDMMM database output with selected medicinal plants (A) medicinal plant information, (B) phytocompound information, (C) DNA barcode information, (D) compound-target information, and (E) disease-associated target information.
Database utility
Case study 1: Validation of raw drugs via the CDMMM reference DNA barcode library
A total of 47 raw drugs from selected medicinal plants were collected from different marketplaces in India (Supplementary Fig. S7). DNA was isolated from 44 raw materials, and the psbA-trnH region was sequenced from three samples. These sequences are then aligned with the reference sequence data provided by the database. The analysis revealed that 24 samples shared significant sequence homology with authentic medicinal plants. However, 8 samples presented sequence homology with the substitutes/adulterants, and the remaining 15 raw drugs were not amplified/sequenced with universal markers. All the DNA barcode regions amplified and sequenced from the raw drugs were submitted to GenBank. The accession numbers of the barcode regions submitted to GenBank are listed (Supplementary Table S6). The comprehensive platform can be effectively utilized to differentiate medicinal plants from their substitutes/adulterants via sequence homology measures.
Case study 2: Network pharmacology analysis of diabetes mellitus via CDMMM therapeutic targets and a metabolome library
To explore the potential use of the CDMMM reference metabolome and therapeutic target database, we performed network analysis of diabetes mellitus. Through network pharmacology analysis, we identified potential lead compounds that can target the hub proteins of diabetes mellitus. With respect to the therapeutic target disease module, 54 potential targets were identified for diabetes mellitus. These 54 potential receptors were targeted by 1219 compounds from the CDMMM reference metabolome library (Supplementary Table S7). Using the StringAPP plugin in Cytoscape v3.10, a protein‒protein interaction network was constructed, and four candidate gene targets, namely, GCG, DPP4, GHRL, and GLP1R, were obtained on the basis of degree (Supplementary Fig. S8). A functional KEGG pathway enrichment analysis of the identified hub genes was performed using EnrichR [https://CRAN.R-project.org/package=enrichR] package by utilizing KEGG_Human pathway library^34–36^. The analysis revealed that the neuroactive ligand‒receptor interaction, insulin secretion, and cAMP signalling pathways were enriched (Supplementary Fig. S9).
Furthermore, molecular docking analysis was performed via the Schrodinger suite 2023-3 to identify potential lead compounds against diabetes mellitus-related candidate gene targets. Molecular docking analysis revealed that the compounds (CDMMM02038, CDMMM01764, and CDMMM01482) obtained from herbs (Curucma oligantha (PL0021) and Cinnamomum tamala (PL0035)) on the basis of the entries in our database had binding affinities of − 15.2261 kcal/mol, − 12.8619 kcal/mol, and − 11.3532 kcal/mol, respectively, against the GLP1R gene target compared with other hub targets (DPP4, GCG, and GHRL). Similarly, the compounds (CDMMM02435 and CDMMM01934) obtained from herbs (Curcuma longa (PL0020) and Asparagus gonoclados (PL0040)) presented relatively high binding affinities of approximately − 13.4863 kcal/mol and − 11.27 kcal/mol, respectively, against the target DPP4 (Supplementary Table S8). By combining molecular docking and MM-GBSA analysis results, the compound Protoprimulagenin A 3-[rhamnosyl-(1- > 4)-rhamnosyl-(1- > 4)-[rhamnosyl-(1- > 2)]-glucosyl-(1- > ?)-glucuronide] (CDMMM01764) had a relatively high dock score (-12.8619 kcal/mol) and free energy binding (dG bound: − 61.52 kcal/mol) against the GLP1R target. Similarly, the compound 1-O-(E)-Caffeoyl-4,6-(S)-HHDP-beta-D-glucopyranose (CDMMM01934) had a relatively high dock score (− 11.27 kcal/mol) and free energy binding (dG Bind: − 53.39 kcal/mol) against the target DPP4. The 3D representations of the ligand‒protein complexes (Complex 1: CDMMM01764 + GLP1R and Complex 2: CDMMM01934 + DPP4) are shown (Supplementary Fig. S10A and Supplementary Fig. S10B). Similarly, the 2D representations of complex 1 and complex 2 are shown (Supplementary Fig. S10C and Supplementary Fig. S10D). Molecular dynamics (MD) simulation analysis was performed for these ligand‒protein complexes for 100 ns via the Desmond module of the Schrodinger suite 2023-3 (Schrodinger 2023-3, LLC, New York). Both protein‒ligand complexes showed stable ligand binding to the protein Cα backbone with fewer fluctuations, as shown (Supplementary Fig. S10E and Supplementary Fig. S10F).
Discussion
In the present study, 67 Indian medicinal plants and their commonly used substitutes, which are traded for the treatment of various diseases, were collected for analysis. Molecular authentication was carried out via DNA barcoding with universal markers to accurately identify the selected medicinal plants at the species level. Comparative metabolite fingerprinting analysis was performed on the top 10 authentic and substitute plant species to identify unique metabolite fingerprints and facilitate downstream network pharmacology analysis. We also developed a comprehensive database of medicinal plants, molecular markers, and metabolite fingerprints (CDMMM) to promote knowledge by sharing comprehensive information on DNA barcodes and metabolite fingerprints in taxonomy and systematics for species identification and resolving taxonomic uncertainties, and the pharmaceutical industry for identifying novel compounds with potential therapeutic properties in the drug discovery process. To the best of our knowledge, this is the first comprehensive digital platform providing a database of DNA barcodes and metabolite fingerprints for Indian medicinal plants.
Several credible databases exist, such as the Indian medicinal plant database^37^, India Biodiversity Portal^38^, OSADHI^39^, and eFloraofIndia^40^, which provide information on vernacular names, taxonomic classification, synonyms, medicinal uses, plant collection sites, and herbarium images of medicinal plants from India. In our developed CDMMM database, information concerning 67 medicinal plants with respect to habit, habitat, plant part used, substitute plant used, synonyms, vernacular names, taxonomical classification, plant collection sites, morphological description and medicinal uses, images of plant species and herbarium voucher specimens of selected medicinal plants from Karnataka, Tamil Nadu and Gujarat states of India were provided to enhance the traditional knowledge of medicinal plants growing in India.
Accurate species-level identification of medicinal plants is urgently needed to ensure the safety and efficacy of raw herbal drugs. The use of plant DNA barcoding has emerged as a reliable and widely accepted method for the authentication of herbal materials and the identification of medicinal plant species^41,42^. Among the various barcode loci evaluated, studies have consistently demonstrated that the internal transcribed spacer (ITS) region has greater species-level discriminatory power than commonly used plastid markers such as rbcL and matK^43,44^. Several countries have made significant progress in integrating DNA barcoding into regulatory frameworks. The establishment of reference plant DNA barcode libraries for Chinese Materia Medica was successful and adopted in their pharmacopoeia^45^. DNA barcode testing methods have been adopted in the Korean herbal pharmacopoeia^46^ and the Japanese and British Pharmacopoeias^47^. At present, for Indian medicinal plants, the Ayurvedic pharmaocopoeia of India reference DNA barcode library (API-RDBL) was developed for 374 medicinal plants with rbcL sequences^48^. The BRM DNA barcode library was developed for 187 Indian medicinal plants with rbcL and ITS2 barcode sequences^49^. However, there is no single platform that covers common loci for the identification of all medicinal plants^50^. In the present study, molecular authentication of 67 medicinal plant species was performed via the ITS marker because of its relatively high variability and species-level resolution. We also developed a BLAST interface in the CDMMM database for nucleotide sequence similarity searches against reference DNA barcode sequences.
The advancement of high-throughput analytical techniques has made LC‒MS/MS-based untargeted metabolomics a common approach for metabolite identification, leveraging public spectral libraries and databases^51^. While comprehensive databases exist for the human metabolome, the number of dedicated databases specifically focused on plant-specialized metabolites is significantly limited^52^. Several databases, such as Phytochemica^53^, IMPAAT^54^, and IMPDB^55^, provide curated metabolite information on Indian medicinal plants on the basis of published literature sources. The use of in silico spectral annotation methods is increasing due to the limited availability of publicly available reference spectra^56^. Recent databases, such as RefMetPlant and PMHub 1.0, exemplify this trend by incorporating in silico prediction tools to expand plant-specific metabolome coverage and improve metabolite annotation^23,57^. In the present study, LC‒MS/MS-based untargeted metabolite fingerprinting was performed for 20 top-traded Indian medicinal plants, and in silico spectral annotation was performed via MS-DIAL integrated with MS-Finder software. These annotated metabolites can be useful for identifying biological targets and treating diseases.
To identify the biological ingredients of traditional Chinese medicine (TCM) and perform downstream network pharmacology analysis, the Watson–Suite database was integrated into the TCM-Suite platform^32^. In the present study, we predicted the associations between annotated plant metabolites and disease targets on the basis of information retrieved from the Therapeutic Target Database. The drug-like and ADME properties of the annotated metabolites are also provided in the CDMMM database. This information may be useful for identifying lead compounds via a network pharmacology approach in the drug discovery process.
In summary, the CDMMM database is a user-friendly, expandable, comprehensive database that provides information on the traditional medicinal plants used in drug formulations. This study generated DNA barcodes and accessions of authentic medicinal plants and their substitutes and provided an option for searching for sequence similarity. The established regional DNA barcode reference library for 67 medicinal plants can act as a platform for accurate medicinal plant identification. The developed plant reference metabolome library for the top 20 traded Indian medicinal plants and their substitutes provides an invaluable resource for plant-derived secondary metabolite-based therapy. Pharmacokinetic evaluation revealed that these metabolites have drug likeliness potential to human disease-associated targets for further identification of suitable lead‒target complexes. The CDMMM database can also aid in identifying and validating novel lead compounds for drug discovery and repurposing. The generated data and database facilitate the identification and translation of medicinal plants into therapeutic drugs. However, the CDMMM database should be further expanded to incorporate additional closely related substitute species, biosynthetic pathways associated with metabolites, and functional enrichment tools for user-defined bioactive compounds and predicted therapeutic targets. These enhancements would enable a deeper understanding of the pharmacological mechanisms of action.
Materials and methods
Selection and collection of medicinal plants
The Indian medicinal plants listed in part ‘B’ of the Ayurvedic Formulary of India for plant-based drugs were selected. The original and substitute source information for drug formulation was retrieved from Ayurvedic classical texts and other published literature from PubMed. The selected medicinal plants were collected from the states of Karnataka, Tamil Nadu, and Gujarat in India. Formal identification of the selected medicinal plants was performed with the help of renowned plant taxonomists, late Dr. K G Bhat and Dr. Radhakrishna Rao, on the basis of morphological and floral characteristics as described in the ‘Flora of Udupi’, ‘Flora of South Canara’ literature, and ‘World Flora Online’ database^58,59^. Permission was obtained for the collection of plant material. Herbarium voucher specimens were prepared and authenticated with the help of taxonomists. These voucher specimens were verified by the Scientific officer, Pilikula Herbarium, Pilikula Development Authority, Mangaluru, Karnataka, India which is recognized by Index Herbariorum (Acronym-PND), and herbarium voucher specimen number was assigned for the selected medicinal plants (Supplementary_material-3). The fresh roots, stem bark, seeds, and flower buds for each sample were immersed in liquid nitrogen and stored frozen for long-term preservation. The germplasm was preserved in a greenhouse at the Manipal School of Life Sciences, MAHE, for future use. Information such as plant parts, habits, habitats, synonyms, vernacular names, morphological descriptions, distributions, and therapeutic uses was retrieved from scientific journals, ayurvedic books, and credible databases^60^.
DNA barcoding
Fresh tissue from each plant species was pulverized with liquid nitrogen. The genomic DNA of the fresh leaf tissue was isolated via a modified CTAB protocol^61^. The universal nuclear internal transcribed spacer region was amplified in 54 plant species, and 13 plant species were amplified with other chloroplast markers (psbA-trnH, rbcL, matK, and trnL-trnF). PCRs were conducted in a 15 µl reaction volume. The reaction mixture consisted of 5 µl of 2X PCR master mix (Maxome Labsciences, India), 1 µl of each forward and reverse primer (10 µM), 1 µl of plant DNA template (50 ng/µl), and the remaining volume was adjusted to 15 µl with sterile deionized water. The primer information and PCR conditions for DNA amplification with different universal markers are listed (Supplementary Table S9). After amplification, the PCR products were examined via 1.2% agarose gel-based electrophoresis. The PCR products were cleaned via a HighPrep PCR Clean-up Kit (MagBio Genomics, USA). The purified products were then sequenced via an automated Genetic Analyser 3500 (Applied Biosystems, USA). The generated DNA barcodes were aligned against the NCBI nucleotide database via the MegaBLAST tool^62^. The highest sequence similarity was selected for the species with greater than 97% query coverage and percent identity. We submitted all the DNA barcode regions amplified in this study to the GenBank database. The phylogenetic analysis of these reference DNA barcode sequences was performed by using the muscle model in MEGA 11^63^ and maximum likelihood phylogeny with the GTR + g substitution model in RAxML v 8.0^64^.
Comparative metabolite fingerprinting of authentic and substitute plants
Sample preparation: The top 10 traded authentic medicinal plants and their 10 substitute plants were ground with liquid nitrogen. Then, 100 mg of the frozen powder was weighed and mixed with 1 mL of extraction buffer solution (75% aqueous methanol and 0.1% formic acid), which had been prechilled on ice. The samples were vortexed and subsequently sonicated in a water bath maintained at ambient temperature for 15 min. The sonication was conducted at a maximum frequency of 40 kHz. Subsequently, centrifugation was performed at a rotational speed of 12,000 RPM for 15 min. The supernatant was filtered through a 0.22 µm sterile filter (Merck, India). The filtrate was then lyophilized, and the yield was measured. The extract was dissolved in 1 ml of 100% methanol for LC‒MS/MS analysis.
Metabolite fingerprinting via mass spectrometry techniques: LC‒MS analysis was performed for the plant species via two analytical platforms: (i) an Agilent 6520 quadrupole time-of-flight (QTOF) liquid chromatography‒mass spectrometry (LC‒MS) system coupled with a 1200 HPLC system, and (ii) a Waters Acquity UPLC H-Class Plus bio system coupled with a Xevo-G2XSQTOF system. The ESI ionization source was maintained on both platforms, and mobile phases consisting of 0.1% formic acid in deionized water (A) and 0.1% formic acid in acetonitrile (B) were used on both platforms for separation, employing a gradient program (Supplementary Table S10). Overall, the flow rate was maintained at 0.5 ml/min, and LC‒MS and MS‒MS analyses were performed in ESI + ionization mode with a mass range from m/z 50–1500. The ion source parameters were as follows: nitrogen gas temperature, 350 °C; nebulizer pressure, 35 psig; capillary voltage, 3 kV; and skimmer voltage, 65 V for the Agilent 6520 LC-ESI-QTOF system; capillary voltage, 3 kV; cone voltage, 40 °C; cone gas flow, 50 L/hr; and desolvation gas flow, 1000 L/hr for the Waters UPLC-HRMS-MS^E^ system. The collision energy was ramped from 10–60 eV for data-dependent acquisition in the auto MS/MS mode on the Agilent 6520 system. Similarly, the Waters UPLC-HRMS-MS^E^ system has data-independent acquisition with 6 eV for low collision energy systems and 20–40 eV for high collision energy systems. Lock mass correction was performed with leucine enkephalin in positive (m/z 556.2771) ionization mode. The mass spectrometry data were acquired via Agilent MassHunter B.07.00 and MassLynx v4.1 workstations for the Agilent 6520 system and Waters UPLC-HRMS-MS^E^ systems, respectively.
Data processing and prediction-based metabolite annotation: The Agilent .d files and Waters MSE raw fragmented channel data files were converted to .mzML format via the MSConvert tool in Proteowizard [https://proteowizard.sourceforge.io/download.html]. The generated .mzML files were processed via MS-DIAL v4.9.2^65^ with specific requirements such as soft ionization, chromatography, conventional LC/MS, profile data, and positive ion mode for data-dependent acquisition for MS/MS data and soft ionization, chromatography, SWATH-MS or the conventional all-ions method with an experimental file (Supplementary Table S11), centroid data, and positive mode for data-independent acquisition of MS^E^ data. The data processing parameters, including mass accuracy, peak detection, deconvolution, identification, adduct ion settings, and alignment parameter settings, are described (Supplementary Table S11). The peak lists obtained from MS-DIAL v4.9.2 were exported to MS-Finder v3.60^66^ for metabolite annotation. The molecular formula and structure prediction were performed by matching experimental spectra with in silico predicted spectra from different natural product databases integrated in MSFINDER v3.60 (HMDB, LipidMAPS, FooDB, PlantCyc, NANPDB, COCONUT, KNApSACK, PubChem, and UNPD databases). The precursor ion mass tolerance and MS/MS fragment mass tolerance were set to 0.05 Da. Structures with MS/MS similarity scores greater than 5 were considered accurately identified. The automated chemical classification for annotated metabolites was performed via the ClassyFire web server^67^.
Molecular target prediction for annotated metabolites
The targets of metabolites from 20 medicinal plants (10 each authentic and substitute) were predicted via SwissTargetPrediction^68^ with a probability score greater than 0.1 and the similarity ensemble approach web server^69^ with a p value < 0.05. The disease and ICD-11 code information for each protein target were retrieved from the Therapeutic Target Databasey^70^. The prediction of human oral bioavailability and ADMET properties for each tentative metabolite was performed with the QikProp module in the Schrodinger suite 2023-3 (Schrodinger 2023-3, LLC, New York).
Database and web interface development
The CDMMM database was developed via LAMP, a Linux-based open-source software package that includes Apache v2.4, MySQL v5.7, and PHP v7.2. The interactive web interface was developed using HTML, bootstrap, CSS, PHP, and JavaScript to access the entries by searching for or browsing botanical names, metabolite names, and target-associated diseases. The plant–compound and compound–target interaction networks were constructed via Cytoscape.js, an open-source JavaScript-based graph library. NCBI Blast + v2.2.29 was incorporated into the CDMMM database to find the sequence homology of newly sequenced DNA barcode sequences against the 89 reference barcode sequences provided in this database.
Supplementary Information
Below is the link to the electronic supplementary material.
Supplementary Material 1
Supplementary Material 2
Supplementary Material 3
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Facts & Factors. 2024. India Herbal Products Market Size, Share, Growth Analysis Report, 2024–2032. FNF Research. Available from: https://www.fnfresearch.com/india-herbal-products-market.
- 2Ratnasingham, S. & Hebert, P. D. N. BOLD: The Barcode of Life Data System (www.barcodinglife.org). Mol. Ecol. Notes.7, 355–364, 10.1111/j.1471-8286.2007.01678.x (2007).10.1111/j.1471-8286.2007.01678.x PMC 189099118784790 · doi ↗ · pubmed ↗
- 3Bhat, K. Flora of Udupi, Indian Naturalist (Regd.). Vol. 106 (2003).
- 4Bhat Gopalakrishna, K. Flora of South Canara. Vol. 1 (2014).
