# Exploratory analysis of exhaled volatile organic compounds for binary discrimination between lung cancer, pneumonia, and healthy controls using machine learning

**Authors:** Jing Wang, Haitian Li, Jianshen Yue, Yamei Song, Ning Wang, Wei Guo, Zhigang Cai

PMC · DOI: 10.3389/fmed.2026.1741424 · Frontiers in Medicine · 2026-02-23

## TL;DR

This study explores using breath analysis to distinguish between lung cancer, pneumonia, and healthy individuals, showing promising results with machine learning.

## Contribution

The study introduces a machine learning-based approach to differentiate lung cancer and pneumonia using exhaled volatile organic compounds.

## Key findings

- Exhaled VOCs showed statistically significant differences between lung cancer, pneumonia, and healthy controls.
- Machine learning models achieved high AUC values (0.956–0.983) in distinguishing between the three groups.
- Seven VOCs showed lower concentrations in lung cancer compared to pneumonia in pairwise comparisons.

## Abstract

Lung cancer remains a major cause of cancer-related mortality worldwide, while pneumonia is one of the most prevalent infectious diseases, with acute pneumonia being highly common globally. Despite continuous advancements in diagnostic technology and the successive launch of new anti-infective drugs, the incidence and mortality rates of pneumonia remain high. Exhaled breath volatile organic compounds (VOCs) have been proposed as non-invasive indicators of disease-related metabolic and pathophysiological alterations. Lung cancer and pneumonia often present with similar nodules or consolidation shadows on chest imaging, leading to frequent diagnostic overlap and delays. This uncertainty can cause lung cancer patients to miss the optimal treatment window or result in unnecessary invasive examinations for pneumonia patients. The current gold standard for definitive diagnosis relies on invasive methods, but it has drawbacks such as operational risks, patient discomfort, radiation exposure, and high costs. Therefore, this study was designed as an exploratory, proof-of-concept investigation to examine whether VOC profiles exhibit distinguishable patterns between lung cancer, pneumonia, and healthy individuals using pairwise binary analytical approaches.

Exhaled breath samples were collected from participants with lung cancer (N = 180), pneumonia (N = 228), and healthy controls (N = 180). Samples were analyzed using a micro gas chromatography system coupled with a mass spectrometry detector (micro-GC–MSD). Univariate statistical analyses were performed to screen for VOCs showing differential abundance between groups. Multivariate analyses were subsequently conducted using five machine learning algorithms to evaluate the discriminative performance of VOC-based models in pairwise binary comparisons between lung cancer and healthy controls, pneumonia and healthy controls, and lung cancer and pneumonia.

Multiple VOCs demonstrated statistically significant differences between groups, although substantial overlap in distributions was observed. Compared with healthy controls, three VOCs (heptane, propane, 1-(methylthio)-, and styrene) showed lower levels and two VOCs (2-hexanone, 6-hydroxy- and o-xylene) showed higher levels in the lung cancer group. In the pneumonia group, six VOCs (1,4-pentadiene, toluene, butyl acetate, p-xylene, D-limonene, and isobutyl nonyl carbonate) were elevated, while one VOC (heptane, 2,2,4,6,6-pentamethyl-) was reduced compared with healthy controls. In pairwise comparisons between lung cancer and pneumonia, seven VOCs showed lower concentrations in the lung cancer group. With area under the receiver operating characteristic curve (AUC) values of 0.980 for lung cancer versus healthy controls, 0.956 for pneumonia versus healthy controls, and 0.983 for lung cancer versus pneumonia.

This exploratory study demonstrates that exhaled breath VOC profiles, analyzed via machine learning, yield statistically distinguishable signals in pairwise comparisons between lung cancer, pneumonia, and healthy individuals. These results provide preliminary evidence that breath analysis could address the critical clinical challenge of differentiating radiographically similar conditions non-invasively. The presented methodology and dataset establish a foundational framework for characterizing disease-specific metabolic signatures. However, the findings remain hypothesis-generating. Definitive evaluation of clinical utility necessitates subsequent studies employing multiclass modeling, validation in independent and prospective cohorts, and direct assessment of diagnostic impact in real-world triage scenarios.

## Linked entities

- **Chemicals:** heptane (PubChem CID 8900), propane, 1-(methylthio)- (PubChem CID 19754), styrene (PubChem CID 7501), 2-hexanone, 6-hydroxy- (PubChem CID 89077), o-xylene (PubChem CID 7237), 1,4-pentadiene (PubChem CID 11587), toluene (PubChem CID 1140), butyl acetate (PubChem CID 31272), p-xylene (PubChem CID 7809), D-limonene (PubChem CID 440917), isobutyl nonyl carbonate (PubChem CID 6420744), heptane, 2,2,4,6,6-pentamethyl- (PubChem CID 26058)
- **Diseases:** lung cancer (MONDO:0005138), pneumonia (MONDO:0005249)

## Full-text entities

- **Genes:** VIP (vasoactive intestinal peptide) [NCBI Gene 7432] {aka PHM27}
- **Diseases:** small cell lung cancer (MESH:D055752), respiratory diseases (MESH:D012140), deaths (MESH:D003643), Wegener's granulomatosis (MESH:D014890), inflammation (MESH:D007249), pulmonary infection (MESH:D012141), HL (MESH:C538324), acute lung injury (MESH:D055371), cough (MESH:D003371), anxiety (MESH:D001007), chest discomfort (MESH:D013898), asthma (MESH:D001249), dyspnea (MESH:D004417), Lung cancer (MESH:D008175), infection (MESH:D007239), pulmonary fungal infections (MESH:D008172), Cancer (MESH:D009369), infectious lung diseases (MESH:D008171), COVID-19 (MESH:D000086382), sarcoidosis (MESH:D012507), chronic obstructive pulmonary disease (MESH:D029424), Pneumonia (MESH:D011014), autoimmune or vascular diseases (MESH:D001327), heart failure (MESH:D006333), non-small cell lung cancer (MESH:D002289), systemic diseases (MESH:D034721), infectious (MESH:D003141), pulmonary dysfunction (MESH:D011660), impaired ventilation (MESH:D053717)
- **Chemicals:** PUFAs (MESH:D005231), caffeine (MESH:D002110), 1,4-pentadiene (-), sulfur (MESH:D013455), ketones (MESH:D007659), (R)-(+)-limonene (MESH:D000077222), Aromatic hydrocarbons (MESH:D006841), n-heptane (MESH:C028618), esters (MESH:D004952), hydrocarbons (MESH:D006838), fatty acids (MESH:D005227), 2-methylbutane (MESH:C067038), toluene (MESH:D014050), nitrogen (MESH:D009584), Butyl acetate (MESH:C006848), lipid (MESH:D008055), p-xylene (MESH:C031286), terpenes (MESH:D013729), pentadiene (MESH:D000466), alkenes (MESH:D000475), heptane (MESH:D006536), phospholipids (MESH:D010743), Water (MESH:D014867), benzene (MESH:D001554), n-non-ane (MESH:C017573), propane (MESH:D011407), alpha-pinene (MESH:C005451), styrene (MESH:D020058), ketone bodies (MESH:D007657), VOC (MESH:D055549), organic compounds (MESH:D009930), ROS (MESH:D017382), 2-hexanone (MESH:D008742), alkanes (MESH:D000473), n-decane (MESH:C012867), isoprene (MESH:C005059), alcohol (MESH:D000438)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12968303/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12968303/full.md

## References

30 references — full list in the complete paper: https://tomesphere.com/paper/PMC12968303/full.md

---
Source: https://tomesphere.com/paper/PMC12968303