Silence is Sweeter Than Speech: Self-Supervised Model Using Silence to   Store Speaker Information

Chi-Luen Feng; Po-chun Hsu; Hung-yi Lee

arXiv:2205.03759·cs.LG·May 10, 2022·6 cites

Silence is Sweeter Than Speech: Self-Supervised Model Using Silence to Store Speaker Information

Chi-Luen Feng, Po-chun Hsu, Hung-yi Lee

PDF

Open Access

TL;DR

This paper reveals that self-supervised speech models like HuBERT store speaker information primarily in silent segments of speech, and leveraging silence can enhance speaker identification accuracy.

Contribution

The study uncovers how HuBERT encodes speaker info in silence segments and demonstrates that adding silence improves speaker identification performance.

Findings

01

Silence segments contain significant speaker information.

02

More silent parts in speech lead to higher SID accuracy.

03

Adding silence to speech improves HuBERT's SID performance by nearly 2%.

Abstract

Self-Supervised Learning (SSL) has made great strides recently. SSL speech models achieve decent performance on a wide range of downstream tasks, suggesting that they extract different aspects of information from speech. However, how SSL models store various information in hidden representations without interfering is still poorly understood. Taking the recently successful SSL model, HuBERT, as an example, we explore how the SSL model processes and stores speaker information in the representation. We found that HuBERT stores speaker information in representations whose positions correspond to silences in a waveform. There are several pieces of evidence. (1) We find that the utterances with more silent parts in the waveforms have better Speaker Identification (SID) accuracy. (2) If we use the whole utterances for SID, the silence part always contributes more to the SID task. (3) If we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling