Silence is Sweeter Than Speech: Self-Supervised Model Using Silence to Store Speaker Information
Chi-Luen Feng, Po-chun Hsu, Hung-yi Lee

TL;DR
This paper reveals that self-supervised speech models like HuBERT store speaker information primarily in silent segments of speech, and leveraging silence can enhance speaker identification accuracy.
Contribution
The study uncovers how HuBERT encodes speaker info in silence segments and demonstrates that adding silence improves speaker identification performance.
Findings
Silence segments contain significant speaker information.
More silent parts in speech lead to higher SID accuracy.
Adding silence to speech improves HuBERT's SID performance by nearly 2%.
Abstract
Self-Supervised Learning (SSL) has made great strides recently. SSL speech models achieve decent performance on a wide range of downstream tasks, suggesting that they extract different aspects of information from speech. However, how SSL models store various information in hidden representations without interfering is still poorly understood. Taking the recently successful SSL model, HuBERT, as an example, we explore how the SSL model processes and stores speaker information in the representation. We found that HuBERT stores speaker information in representations whose positions correspond to silences in a waveform. There are several pieces of evidence. (1) We find that the utterances with more silent parts in the waveforms have better Speaker Identification (SID) accuracy. (2) If we use the whole utterances for SID, the silence part always contributes more to the SID task. (3) If we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling
