# Evaluating few-shot prompting for spectrogram-based lung sound classification using a multimodal language model

**Authors:** Nicholas Dietrich, David McShannon, Mark F. Rzepka

PMC · DOI: 10.1371/journal.pdig.0001179 · 2026-01-07

## TL;DR

This study explores using a multimodal AI model, GPT-4o, to classify lung sounds from spectrograms, finding that providing a few examples improves performance slightly.

## Contribution

Demonstrates that few-shot prompting improves lung sound classification performance using a general-purpose multimodal LLM.

## Key findings

- Few-shot prompting improved accuracy (0.363 vs. 0.320) and other metrics over zero-shot prompting.
- Model repeatability was high (κ = 0.76–0.88), indicating strong consistency.
- Performance gains were statistically significant (p < 0.001) but insufficient for clinical use.

## Abstract

Traditional deep learning models for lung sound analysis require large, labeled datasets, whereas multimodal large language models (LLMs) may offer a flexible, prompt-based alternative. This study aimed to evaluate the utility of a general-purpose multimodal LLM, GPT-4o, for lung sound classification from mel-spectrograms and assess whether a few-shot prompt approach improves performance over zero-shot prompting. Using the ICBHI 2017 Respiratory Sound Database, 6898 annotated respiratory cycles were converted into mel-spectrograms. GPT-4o was prompted to classify each spectrogram using both zero-shot and few-shot strategies. Model outputs were evaluated against ground truth labels using performance metrics including accuracy, precision, recall, and F1-score. Few-shot prompting improved overall accuracy (0.363 vs. 0.320) and yielded modest gains in precision (0.316 vs. 0.283), recall (0.300 vs. 0.287), and F1-score (0.308 vs. 0.285) across labels. McNemar’s test indicated a statistically significant difference in performance between prompting strategies (p < 0.001). Model repeatability analysis demonstrated high agreement (κ = 0.76–0.88; agreement: 89–96%), indicating excellent consistency. GPT-4o demonstrated limited but statistically significant performance gains using few-shot prompting for lung sound classification. While current performance remains insufficient for clinical deployment, this prompt-based approach provides a baseline for spectrogram-based multimodal tasks and a foundation for future exploration of prompt-based multimodal inference.

Lung sounds, such as wheezes and crackles, can offer important clues about respiratory health. Traditionally, doctors use a stethoscope to listen for these sounds, but interpreting them accurately can be difficult and often requires experience. We wanted to explore whether a new type of artificial intelligence (AI), called a multimodal language model, could help identify different lung sounds by looking at visual representations of those sounds known as spectrograms. Specifically, we used a model called GPT-4o, which is designed to understand both images and text, and tested whether giving it a few examples of labeled lung sounds would help it perform better. We found that this few-example or “few-shot” prompting approach led to modest but meaningful improvements in how accurately the model could identify different types of lung sounds compared to giving it no examples at all. While the model’s current performance is insufficient for clinical deployment, our findings establish a foundational baseline, demonstrating that general-purpose AI tools can exhibit in-context learning to improve lung sound classification. This provides a direction for developing flexible and accessible AI support in resource-limited settings.

## Full-text entities

- **Diseases:** LLMs (MESH:D007806), hallucinations (MESH:D006212), Crackles (MESH:D012135)
- **Chemicals:** GPT-4o (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12779135/full.md

---
Source: https://tomesphere.com/paper/PMC12779135