CoughViT: A Self-Supervised Vision Transformer for Cough Audio Representation Learning
Justin Luong, Hao Xue, Flora D. Salim

TL;DR
CoughViT is a self-supervised vision transformer framework designed to learn general cough sound representations, improving diagnostic accuracy in respiratory disease detection with limited labeled data.
Contribution
It introduces a novel self-supervised pre-training method using masked data modeling for cough audio, enhancing performance over supervised methods in low-data scenarios.
Findings
Outperforms existing supervised audio representations in cough classification tasks.
Effective in scenarios with limited labeled data.
Achieves state-of-the-art results on multiple respiratory sound datasets.
Abstract
Physicians routinely assess respiratory sounds during the diagnostic process, providing insight into the condition of a patient's airways. In recent years, AI-based diagnostic systems operating on respiratory sounds, have demonstrated success in respiratory disease detection. These systems represent a crucial advancement in early and accessible diagnosis which is essential for timely treatment. However, label and data scarcity remain key challenges, especially for conditions beyond COVID-19, limiting diagnostic performance and reliable evaluation. In this paper, we propose CoughViT, a novel pre-training framework for learning general-purpose cough sound representations, to enhance diagnostic performance in tasks with limited data. To address label scarcity, we employ masked data modelling to train a feature encoder in a self-supervised learning manner. We evaluate our approach against…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
