CoughViT: A Self-Supervised Vision Transformer for Cough Audio Representation Learning

Justin Luong; Hao Xue; Flora D. Salim

arXiv:2508.03764·cs.SD·August 7, 2025

CoughViT: A Self-Supervised Vision Transformer for Cough Audio Representation Learning

Justin Luong, Hao Xue, Flora D. Salim

PDF

TL;DR

CoughViT is a self-supervised vision transformer framework designed to learn general cough sound representations, improving diagnostic accuracy in respiratory disease detection with limited labeled data.

Contribution

It introduces a novel self-supervised pre-training method using masked data modeling for cough audio, enhancing performance over supervised methods in low-data scenarios.

Findings

01

Outperforms existing supervised audio representations in cough classification tasks.

02

Effective in scenarios with limited labeled data.

03

Achieves state-of-the-art results on multiple respiratory sound datasets.

Abstract

Physicians routinely assess respiratory sounds during the diagnostic process, providing insight into the condition of a patient's airways. In recent years, AI-based diagnostic systems operating on respiratory sounds, have demonstrated success in respiratory disease detection. These systems represent a crucial advancement in early and accessible diagnosis which is essential for timely treatment. However, label and data scarcity remain key challenges, especially for conditions beyond COVID-19, limiting diagnostic performance and reliable evaluation. In this paper, we propose CoughViT, a novel pre-training framework for learning general-purpose cough sound representations, to enhance diagnostic performance in tasks with limited data. To address label scarcity, we employ masked data modelling to train a feature encoder in a self-supervised learning manner. We evaluate our approach against…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.