# Metabolomic analysis of Yunnan cigar tobacco leaves: impact of geography and climate on flavor characteristics and machine learning-based origin traceability

**Authors:** Yuping Wu, Guijuan Zhao, Yi Li, Guifeng Li, Wenyuan Wang, Lei Yang, Zhonglong Lin, Heng Yao, Fangchan Jiao, Gaokun Zhao, Yongping Li, Guanghai Zhang, Meiwei Zhao, Tao Zhang, Jin Wang

PMC · DOI: 10.3389/fpls.2025.1703429 · Frontiers in Plant Science · 2026-02-18

## TL;DR

This study explores how geography and climate affect the flavor of Yunnan cigar tobacco and uses machine learning to accurately trace its origin.

## Contribution

The paper introduces a machine learning-based method for origin traceability of cigar tobacco using metabolomic biomarkers.

## Key findings

- Yunnan's climate and geography influence distinct metabolic pathways in cigar tobacco leaves.
- Machine learning models achieved high accuracy in distinguishing Yunnan from other regions' tobacco samples.
- Twelve key biomarkers were identified for origin traceability with minimal false classification rates.

## Abstract

To investigate how Yunnan's distinctive geographical and climatic conditions shape the unique metabolic profile of its cigar tobacco leaves (CTLs), and to establish a reliable method for origin traceability using machine learning, a non-targeted metabolomics analysis was conducted on 71 CTL samples collected from the Dominican Republic, Indonesia, and Yunnan, including Lincang, Pu’er, and Yuxi within Yunnan. A total of 778 highly reliable metabolites were identified. Influenced by Yunnan's high altitude, large diurnal temperature variation, intense ultraviolet radiation, and relative dryness, its CTLs exhibited characteristic metabolic profiles, with significant enrichment in pathways such as flavone and flavonol biosynthesis and betalain biosynthesis. Elevated levels of polyphenols, indoles, jasmonates, carotenoids, and other compounds were linked to Yunnan CTLs' distinct woody, roasted, and astringent flavor profile. Twelve key biomarkers were selected using Multivariate methods with unbiased variable selection in R (MUVR). Machine learning algorithms—including LDA, LR, GMM, KNN, and SVM—were applied to these biomarkers, achieving highly accurate origin discrimination across national (Yunnan vs. Dominican Republic/Indonesia) and regional (Lincang, Pu’er, Yuxi) scales. Validation results showed a median false classification rate of 0.1 over 100 iterations and an AUC close to 1, confirming the model's high accuracy and robustness for CTLs origin traceability.

## Linked entities

- **Chemicals:** flavone (PubChem CID 10680), flavonol (PubChem CID 11349), betalain (PubChem CID 56841626), indoles (PubChem CID 139191468), carotenoids (PubChem CID 11227325)

## Full-text entities

- **Genes:** LOC107827378 (protoporphyrinogen oxidase, chloroplastic) [NCBI Gene 107827378] {aka NtPPOX1, PPO, ppxI}
- **Diseases:** metabolic diseases (MESH:D008659), PCA (MESH:C566443), smoking (MESH:D015208), hypoxic (MESH:D002534)
- **Chemicals:** Hydrocinnamic acid (MESH:C035253), Flavonol (MESH:C041477), Sinapinic acid (MESH:C073734), branched chain amino acid (MESH:D000597), alkaloid (MESH:D000470), phenolic acids (MESH:C017616), Fe (MESH:D007501), Ferulic acid (MESH:C004999), Indoles (MESH:D007211), trans-Cinnamic acid (MESH:C029010), Methylcinnamate (MESH:C025385), Kaempferol (MESH:C006552), nicotinamide (MESH:D009536), Carotenoids (MESH:D002338), 1-Acetylaspidoalbidine (MESH:C549511), Benzoin (MESH:D001573), Rutin (MESH:D012431), flavonols (MESH:D044948), carbon (MESH:D002244), 3-Methoxysalicylic acid (MESH:C523953), nitrogen (MESH:D009584), Flavone (MESH:C043562), sugar (MESH:D000073893), 3-Indolepropionic acid (MESH:C095446), alpha-Ionone (MESH:C011879), Gallic acid (MESH:D005707), Zn (MESH:D015032), Monobutyl phthalate (MESH:C028577), quinones (MESH:D011809), Trans-3-Indoleacrylic acid (MESH:C001446), Linoleic acid (MESH:D019787), indole (MESH:C030374), Mg (MESH:D008274), Flavonoids (MESH:D005419), JA (MESH:C011006), Eicosapentaenoic acid ethyl ester (MESH:C035276), Ca (MESH:D002118), NAD (MESH:D009243), Matairesinol (MESH:C068935), Caffeic acid (MESH:C040048), sucrose (MESH:D013395), lipids (MESH:D008055), lignin (MESH:D008031), Polyphenol (MESH:D059808), starch (MESH:D013213), phenols (MESH:D010636), carbohydrate (MESH:D002241), nicotine (MESH:D009538), 3-Hydroxyanthranilic acid (MESH:D015095), Nicotinate (MESH:D009525), isoflavones (MESH:D007529), guaiacol (MESH:D006139), NADP (MESH:D009249), urea (MESH:D014508), alpha - ketoglutarate (MESH:D007656), polyunsaturated fatty acid (MESH:D005231), Anacardic acid (MESH:C088115), Betalain (MESH:D050858), Indole-3-lactic acid (MESH:C024139), -balsamic (-)
- **Species:** Nicotiana tabacum (American tobacco, species) [taxon 4097], Malus hupehensis (species) [taxon 106556], Chenopodium quinoa (quinoa, species) [taxon 63459], Triticum aestivum (bread wheat, species) [taxon 4565], Vanilla (genus) [taxon 51238], Arabidopsis thaliana (mouse-ear cress, species) [taxon 3702]
- **Mutations:** C-28 C, C-32 C

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12957283/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12957283/full.md

## References

77 references — full list in the complete paper: https://tomesphere.com/paper/PMC12957283/full.md

---
Source: https://tomesphere.com/paper/PMC12957283