TL;DR
This paper introduces a semi-supervised learning framework for detecting medical conditions from speech dialogues, effectively leveraging unlabeled data through multi-level data modeling and achieving high performance with minimal labeled samples.
Contribution
It presents a novel hierarchical SSL approach that models frame, segment, and session levels, improving disease detection in speech with limited labeled data.
Findings
Achieves 90% of fully-supervised performance with only 11 labeled samples.
Framework is model-agnostic and robust across languages and conditions.
Effectively utilizes unlabeled clinical dialogues through pseudo-labeling.
Abstract
Detecting medical conditions from speech acoustics is fundamentally a weakly-supervised learning problem: a single, often noisy, session-level label must be linked to nuanced patterns within a long, complex audio recording. This task is further hampered by severe data scarcity and the subjective nature of clinical annotations. While semi-supervised learning (SSL) offers a viable path to leverage unlabeled data, existing audio methods often fail to address the core challenge that pathological traits are not uniformly expressed in a patient's speech. We propose a novel, audio-only SSL framework that explicitly models this hierarchy by jointly learning from frame-level, segment-level, and session-level representations within unsegmented clinical dialogues. Our end-to-end approach dynamically aggregates these multi-granularity features and generates high-quality pseudo-labels to efficiently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
