deHuBERT: Disentangling Noise in a Self-supervised Model for Robust   Speech Recognition

Dianwen Ng; Ruixi Zhang; Jia Qi Yip; Zhao Yang; Jinjie Ni; Chong; Zhang; Yukun Ma; Chongjia Ni; Eng Siong Chng; Bin Ma

arXiv:2302.14597·cs.SD·March 1, 2023

deHuBERT: Disentangling Noise in a Self-supervised Model for Robust Speech Recognition

Dianwen Ng, Ruixi Zhang, Jia Qi Yip, Zhao Yang, Jinjie Ni, Chong, Zhang, Yukun Ma, Chongjia Ni, Eng Siong Chng, Bin Ma

PDF

Open Access

TL;DR

deHuBERT is a novel self-supervised training framework that enhances speech recognition robustness by disentangling noise from speech representations, improving performance in noisy conditions without sacrificing accuracy on clean data.

Contribution

The paper introduces deHuBERT, a new training method that applies auxiliary losses to produce noise-agnostic speech embeddings, advancing robustness in noisy environments.

Findings

01

Improved speech recognition accuracy in noisy conditions.

02

Maintains performance on clean speech data.

03

Effective against unseen noise types.

Abstract

Existing self-supervised pre-trained speech models have offered an effective way to leverage massive unannotated corpora to build good automatic speech recognition (ASR). However, many current models are trained on a clean corpus from a single source, which tends to do poorly when noise is present during testing. Nonetheless, it is crucial to overcome the adverse influence of noise for real-world applications. In this work, we propose a novel training framework, called deHuBERT, for noise reduction encoding inspired by H. Barlow's redundancy-reduction principle. The new framework improves the HuBERT training algorithm by introducing auxiliary losses that drive the self- and cross-correlation matrix between pairwise noise-distorted embeddings towards identity matrix. This encourages the model to produce noise-agnostic speech representations. With this method, we report improved…

Tables2

Table 1. Table 1 : Experimental results on the given synthesized noisy data for various noise types of SNRs (0-20)dB without a LM.

Methods

Pre-train

WER (%) under noisy (0 – 20 dB) SNR and clean environment

↓

Type-B noise

Type-A noise

Avg. (noisy)

Clean (subset)

Babble

Airport/

Station

AC/

Vacuum

Cafe

Traffic

Metro

Car

Fine-tuning: 10-hours labeled (with additive FreeSound noise)

HuBERT Base

Clean

33.71

26.85

23.82

20.19

19.05

18.26

12.91

22.11

13.5

HuBERT Base

FreeSound

27.93

22.33

20.77

17.58

17.08

17.30

13.05

19.43

13.7

deHuBERT (Ours)

FreeSound

26.58

21.23

20.14

16.83

16.05

15.74

11.95

18.36

12.8

Fine-tuning: 1-hour labeled (with additive FreeSound noise)

HuBERT Base

Clean

49.72

41.86

39.98

35.79

34.42

33.08

26.74

37.37

27.8

HuBERT Base

FreeSound

42.54

36.83

36.11

32.82

32.19

31.77

27.60

34.27

29.1

deHuBERT (Ours)

FreeSound

41.74

36.27

35.54

32.41

31.51

31.24

26.68

33.63

28.4

Fine-tuning: 10-mins labeled (with additive FreeSound noise)

HuBERT Base

Clean

70.25

63.62

61.89

57.68

55.41

54.66

47.95

58.78

48.4

HuBERT Base

FreeSound

60.53

56.31

56.00

52.92

53.16

52.58

49.56

54.44

50.7

deHuBERT (Ours)

FreeSound

58.59

53.82

53.88

50.66

49.67

49.71

45.80

51.73

47.1

Fine-tuning: 100-hours labeled (with additive FreeSound noise)

DEMUCS [15]

FreeSound

45.56

36.98

38.20

27.02

26.46

23.22

16.02

30.49

10.9

AvT [15]

No

43.42

35.32

36.62

27.06

27.88

24.28

17.76

30.33

13.1

Wav2vec 2.0 [11]

Clean

47.50

39.68

38.84

31.14

29.22

27.44

18.24

33.15

14.0

Wav2vec 2.0 [11]

FreeSound

39.56

32.50

34.94

25.22

24.52

22.48

16.24

27.92

13.5

EW2 [11]

FreeSound

33.88

27.36

27.94

22.08

20.94

19.84

14.88

23.85

12.3

HuBERT Base

FreeSound

22.52

16.91

15.94

12.79

12.43

12.20

8.39

14.45

9.4

deHuBERT (Ours)

FreeSound

21.25

16.02

14.93

11.94

11.66

11.21

7.62

13.52

8.6

Table 2. Table 2 : Results on various out-of-domain noisy conditions. We finetuned our model with 10h (respective) dataset.

Testing set from the original data (Clean)
Models	FT Data (10hrs)	WER (%) of testing data $↓$
		LS (Test set)		TEDLIUM
		Clean	Other	Dev	Test
HuBERT Base	LibriSpeech	9.8	18.2	25.4	23.6
deHuBERT (Ours)	LibriSpeech	10.1	18.1	25.5	23.8
HuBERT Base	TEDLIUM	14.9	23.8	18.1	17.3
deHuBERT (Ours)	TEDLIUM	15.2	23.7	18.2	17.4
Testing set with additive FreeSound noise (0–20 dB)
HuBERT Base	LibriSpeech	20.3	36.4	35.8	36.4
deHuBERT (Ours)	LibriSpeech	13.4	26.0	30.1	30.3
HuBERT Base	TEDLIUM	23.5	38.8	26.4	27.8
deHuBERT (Ours)	TEDLIUM	19.3	32.8	22.7	22.8
Testing set with additive OOD, office noise (0–20 dB)
HuBERT Base	LibriSpeech	26.6	44.5	42.2	43.9
deHuBERT (Ours)	LibriSpeech	17.0	32.0	33.7	35.5
HuBERT Base	TEDLIUM	30.6	46.2	34.5	35.3
deHuBERT (Ours)	TEDLIUM	23.2	37.4	26.2	27.7

Equations8

C_{ij}^{(cc)} ≜ \frac{\sum _{n} y _{n, i} y ~ _{n, j}}{\sum _{n} ( y _{n, i} ) ^{2} \sum _{n} ( y ~ _{n, j} ) ^{2}}

C_{ij}^{(cc)} ≜ \frac{\sum _{n} y _{n, i} y ~ _{n, j}}{\sum _{n} ( y _{n, i} ) ^{2} \sum _{n} ( y ~ _{n, j} ) ^{2}}

L_{cc} ≜ invariance term i \sum (1 - C_{ii})^{2} + λ disentangling term i \sum j \neq = i \sum C_{ij}^{2}

L_{cc} ≜ invariance term i \sum (1 - C_{ii})^{2} + λ disentangling term i \sum j \neq = i \sum C_{ij}^{2}

infoNCE’s positive contrastive - n \sum \frac{⟨ y _{n} , y ~ _{n} ⟩ _{i}}{τ ∥ y _{n} ∥ _{2} ∥ y ~ _{n} ∥ _{2}} \mspace 27.0 m u proposed invariance term i \sum (1 - \frac{⟨ y _{., i} , y ~ _{., i} ⟩ _{n}}{∥ y _{., i} ∥ _{2} ∥ y ~ _{., i} ∥ _{2}})^{2}

infoNCE’s positive contrastive - n \sum \frac{⟨ y _{n} , y ~ _{n} ⟩ _{i}}{τ ∥ y _{n} ∥ _{2} ∥ y ~ _{n} ∥ _{2}} \mspace 27.0 m u proposed invariance term i \sum (1 - \frac{⟨ y _{., i} , y ~ _{., i} ⟩ _{n}}{∥ y _{., i} ∥ _{2} ∥ y ~ _{., i} ∥ _{2}})^{2}

L = L_{HB} + α L_{CC} + β L_{SC}

L = L_{HB} + α L_{CC} + β L_{SC}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

Full text

de’HuBERT: Disentangling noise in a self-supervised model for robust speech recognition

Abstract

Existing self-supervised pre-trained speech models have offered an effective way to leverage massive unannotated corpora to build good automatic speech recognition (ASR). However, many current models are trained on a clean corpus from a single source, which tends to do poorly when noise is present during testing. Nonetheless, it is crucial to overcome the adverse influence of noise for real-world applications. In this work, we propose a novel training framework, called deHuBERT, for noise reduction encoding inspired by H. Barlow’s redundancy-reduction principle. The new framework improves the HuBERT training algorithm by introducing auxiliary losses that drive the self- and cross-correlation matrix between pairwise noise-distorted embeddings towards identity matrix. This encourages the model to produce noise-agnostic speech representations. With this method, we report improved robustness in noisy environments, including unseen noises, without impairing the performance on the clean set.

**Index Terms— ** self-supervised learning, disentangling representations, noise robust automatic speech recognition

1 Introduction

Recently, self-supervised pre-training in speech has seized the limelight with numerous successes in building a highly effective automatic speech recognition (ASR) system [1, 2], especially for low-resource languages [3]. This success stems from leveraging large amounts of unannotated utterances to construct universal speech representations that benefit downstream ASR tasks. Such frameworks include contrastive predictive coding (CPC) [4], which learns by making the next step prediction using a contrastive loss, and autoregressive predictive coding (APC) [5] that builds its speech representations by reconstructing future frames from the past sequence.

Most of these works focused on a single domain of relatively clean audio, e.g. LibriSpeech [6], that lacks domain variation. Nevertheless, speech in real-world environments usually contain background noises, reverberation and other non-linear distortions. [7] had shown that many off-the-shelf universal speech models are vulnerable to this issue, where the performance of downstream ASR systems significantly degrade if there is a domain shift from the pre-training data.

To improve the noise robustness, [8] modified wav2vec2.0 (w2v2) to include a contrastive loss that learns the cross-quantized targets between the original-noisy pair. Likewise, [9] employed contrastive loss as a regularizer to achieve noise-reduced speech features. [10] provides another approach using a teacher-student framework to encode denoising representations from the perturbed data that resembles a siamese network. In addition, [11] constructed an enhanced w2v2 that minimizes the consistency between noisy and clean features, and [12] introduced an auxiliary reconstruction task to improve the noise robustness of the learned representations. However, most of these approaches maybe hard to reproduce and involve careful implementation details.

In this paper, we aim to improve the noise robustness of the self-supervised pre-trained HuBERT [2] model for noisy ASR. We achieve this by introducing a new pair of auxiliary loss functions that encourages noise invariance in HuBERT’s embedded contextual representations. To realize this, we propose a novel self-supervised training framework, disentangled HuBERT (deHuBERT), which regularizes HuBERT training using the recently proposed Barlow Twins [13], a method which reduces redundant information between the vector representations in images. We adapt this technique for sequential modelling and show that it is simple and highly effective in learning noise-invariant speech representations. The method aggregates the cross-correlation matrix between the embeddings of two identical networks forward-fed with different noise-augmented samples and pushes it towards the identity matrix. For the diagonal elements of the cross-correlation matrix to approach 1, the network has to extract agreeing features (i.e. speech content) of the two augmented utterances while minimizing other variational factors (i.e. background noises) between the dimensional representations at the frame level. Furthermore, decorrelating the off-diagonal elements creates the conditions for disentanglement. Experimental results show that our pre-trained model consistently exhibits better robustness in noisy environments, including unseen noises, without compromising the performance of the clean audio test set.

2 Methodology

2.1 HuBERT

The HuBERT model architecture follows w2v2 with a convolutional encoder, BERT encoder, projection layer and code embedding layer. HuBERT adapts the BERT model from NLP to perform self-supervised speech representation learning. This allows the encoder to discover good high-level latent representations of both acoustic and language information from the continuous speech signals. During pre-training, it exploits an offline clustering step (i.e., using the K-Means algorithm) to generate the aligned discrete target labels (codes) for computing the BERT-like prediction loss from the masked frames, following the SpanBERT masking strategy. The training of HuBERT is initiated with hidden units of $K=100$ clusters derived based on the MFCC features of the raw audio data. In the subsequent iterations, the target codes are updated based on a hidden unit of $(K=500)$ clusters determined using the intermediate latent representations of the sixth layer of HuBERT’s transformer at the second iteration. However, the HuBERT training algorithm does not inherently disentangle representations for noise separation or reduction, making the encoder vulnerable to noise.

2.2 deHuBERT

To obtain disentangled noise-agnostic representations using the HuBERT model, our proposed deHuBERT training algorithm makes use of the HuBERT to generate, in parallel, a second embedding of a different noise-augmented version using a shared CNN encoder, as shown in Fig 1. Here, two sets of noise are randomly selected and added to the training data with SNRs ranging between 0-25 dB. We then collect the encoded feature representations, $X$ and $\tilde{X}$ , from the intermediate outputs and pass them to a shared linear projection block to get $Y$ and $\tilde{Y}$ respectively. Finally, following the losses introduced by [13], we derive the empirical cross-correlation (CC) matrix by

[TABLE]

where $n$ denotes the number of frames used and $i$ , $j$ refer to the dimensional position of the frame-level representations. Note that $C\in[-1,1]$ is a square matrix of $d$ -dimensional based on the size of the projected output. We employ a CC loss that pushes the CC matrix towards the identity matrix. This loss function is defined by

[TABLE]

where $\lambda$ is a penalizing parameter that balances the trade off between the first and second terms of the loss.

Since $Y$ and $\tilde{Y}$ are sequential features, ignoring the frame-level correlation tends to overestimate the variability. To account for this, we flatten the outputs and remove the zero-padded frames within each minibatch before we perform a random sampling of size $n$ , where we will index on both $Y$ and $\tilde{Y}$ identically. This causes the feature set to be more independent and will give us some control in tuning the stability of the proposed framework.

To understand how the proposed CC loss can reduce noise to obtain invariant features, we compare it to the infoNCE [14] loss. Formally, the first term in Eq. 2 shares a close resemblance to the positive contrastive pair in infoNCE as presented in Eq. 3.

[TABLE]

Similar to the objective behind the positive contrastive loss, we try to maximize the agreeing speech content between the two distorted embeddings and lower other variations (e.g. noise) by getting the two dimensional feature components perfectly correlated. Likewise, decorrelating the off-diagonal matrix discourages information sharing over the feature components while simultaneously encouraging disentangled representations.

To gain further disentanglement in the output representations, we build another linear projection block of the same structure that takes in the bottleneck representations $Z$ to compute the projected $P_{Z}$ for estimating the empirical self-correlation (SC). The estimation can be done by reusing the computational function in Eq. 1 with random sampling, and replacing the arguments with ( $P_{Z}$ , $P_{Z}$ ). Again, we compute the SC loss similar to Eq. 2 but with the SC matrix. In practice, we believe that CC loss may not be perfect in obtaining noise-invariant representations. Disentangling the bottleneck features and then using them to predict the hidden units (i.e. Hubert’s codes) of the original clean training audio guides the encoder to detect the residual noise information and eventually suppressing it in the final contextual representations.

The complete optimization loss used in our pre-training framework is given by

[TABLE]

where the three terms refer to the HuBERT loss, cross-correlation loss and self-correlation losses, respectively. $\alpha$ and $\beta$ have both been set to 0.5 in this work.

3 Experiment

3.1 Data Description

We set up our data environments following [15, 11] for performance comparisons. In our experiments, we use the full 960h of Librispeech for pre-training and the dev-clean corpus for the validation set. The noise dataset used for training is obtained from FreeSound [16], which consists of 16kHz noise data which can be categorized into stationary (Type A) and non-stationary (Type B). The type A noises available are Car, Metro and Traffic noises. In the Type B category, Babble, Airport/Station, Cafe and AC/Vacuum noises are available. Each type of noise has 10 and 8 different audio streams in the training and testing sets, respectively. The total duration of the noise data is around 2h. During testing, 120 randomly chosen sub-files from the test-clean set of Librispeech are used, as per the standard procedure for testing on this dataset. In addition, LibriSpeech comes with pre-mixed noises at different SNRs between 0-20 dB, which ultimately makes up 4200 instances of noisy test data. The noise data and noisy test sets can be downloaded from the website111 https://github.com/archiki/Robust-E2E-ASR.

3.2 Model Pre-training

We perform continual pre-training by utilizing the weights provided by the Fairseq toolkit for 250k steps. In our implementation, we construct the final projection block with the corresponding $d$ -dimensional size of 2048 and 4096 for CC and SC. In contrast to [13], we observed a concave plot of the performance with the effect on increasing dimensionality of the projector network. Additionally, we sampled n=640, and we found that adopting a smaller sample size benefits early-stage learning as it contains a slightly higher estimation error that excites the network and allows the model to escape from the local minimum. However, this requires a smaller $\lambda=0.005$ to limit the adversity contributed by the estimation error. Finally, we also found that applying a smaller learning rate of 7e-5 leads to better model pre-training.

3.3 Model Fine-tuning

We used the best checkpoint from the pre-training and followed the typical base setup for 100h, 10h, 1h and 10m. The ASR finetuning involves only the HuBERT component. Additionally, we employed multi-conditioning training with the training noise of 0 to 20 dB. Finally, we tested our performance with the best checkpoint according to the validation WER for final evaluations.

4 Experimental Results

We compare our results without a language model with an off-the-shelf HuBERT as the baseline to determine the efficiency of our model in learning a noise-robust ASR with limited finetuning data. Also, we included results from the HuBERT base model that undergoes multi-conditioning pre-training to cast a holistic analysis. Table 1 shows the ASR performance in WER based on the subset test-clean audio pre-mixed with the individual noise types of SNRs between 0 and 20 dB. We observe that pre-training HuBERT with noise helps to improve the adaptability to noise on the downstream ASR, but this comes at the cost of degrading clean speech performance. Nonetheless, deHuBERT outperforms baseline HuBERT on both noisy and clean speech regardless of the pre-training condition. Additionally, the difference in performance becomes more apparent with the increasing scarcity of finetuning resources. Finally, we investigate the experiment with the typical 100h finetuning to compare our deHuBERT with existing models. On the complete test-clean and test-other set, we achieved a WER of 6.3% and 13.2%, respectively. This score is comparable to the baseline performance despite using only noisy speech for finetuning. Additionally, deHuBERT achieves the top WER on the noisy data.

To visualize the noise-agnostic properties of the deHuBERT embeddings, we plot the t-SNE of the bottleneck features of both HuBERT and deHuBERT in Fig. 2. The features were obtained from 720 randomly selected audio samples of train-clean-100 mixed with 0 dB of Airport, Metro and Cafe noises. Before plotting, we performed a global mean pooling of all the bottle neck features in a sequence to get vector representations before applying the t-SNE algorithm. On the HuBERT Base plot (left), we can identify clusters consisting of samples with the same noise type, indicating the presence of noise information. In comparison, the deHuBERT plot (right) exhibits no clear clustering according to the type of noise.

4.1 Post-methodology Study

In this section, we are stress testing our model to determine the robustness of its out-of-domain (OOD) performance. We use the TEDLIUM3 [17] dataset to explore the effect of domain shift with noisy ASR. Moreover, we introduce out-of-domain office noise from FSD50K [18] by selecting noise from the group Whispering, Writing, Typing, Typewriter, Telephone, Conversation, Laughter, Computer Keyboard and Printer. We filter those that are less than 10m, which led us to 385 files. Table 2 presents the performance based on finetuning the selected clean audio set (10h) on the complete test set under three different conditions: (1) In-domain (ID) clean test set, (2) ID pre-train noise but OOD finetuning, (3) OOD pre-train noise and OOD finetuning. Firstly, our pre-trained model is comparable to the base under the condition (1). This is important as it indicates that our model remains robust and is unaffected by noisy pre-training. Secondly, even on unseen noise during finetuning, deHuBERT performs consistently better than HuBERT base under noisy environments in conditions (2) and (3). Lastly, although there is still a degradation in performance on ID and OOD noisy ASR, the percentage increase in WER is relatively lower in deHuBERT than for HuBERT base, especially for condition (3).

5 Conclusion

In this paper, we proposed a novel pre-training framework that disentangles noise with the self- and cross-correlation loss for more robust speech recognition. Our model exhibits superiority in handling noisy ASR environments, including OOD noises, without compromising the performance of the clean audio test. The t-SNE plot of the contextual representations from deHuBERT offers a visual understanding of the improvement in noise robustness by observing randomly scattered projection that implies meagre embedded noise information.

Bibliography18

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli, “wav 2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems , vol. 33, pp. 12449–12460, 2020.
2[2] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 29, pp. 3451–3460, 2021.
3[3] Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, et al., “Xls-r: Self-supervised cross-lingual speech representation learning at scale,” ar Xiv preprint ar Xiv:2111.09296 , 2021.
4[4] Aaron van den Oord, Yazhe Li, and Oriol Vinyals, “Representation learning with contrastive predictive coding,” ar Xiv preprint ar Xiv:1807.03748 , 2018.
5[5] Yu-An Chung, Wei-Ning Hsu, Hao Tang, and James Glass, “An unsupervised autoregressive model for speech representation learning,” Proc. Interspeech 2019 , pp. 146–150, 2019.
6[6] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) . IEEE, 2015, pp. 5206–5210.
7[7] Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, et al., “Robust wav 2vec 2.0: Analyzing domain shift in self-supervised pre-training,” ar Xiv preprint ar Xiv:2104.01027 , 2021.
8[8] Yiming Wang, Jinyu Li, Heming Wang, Yao Qian, Chengyi Wang, and Yu Wu, “Wav 2vec-switch: Contrastive learning from original-noisy speech pairs for robust speech recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2022, pp. 7097–7101.