The Effect of Spoken Language on Speech Enhancement using Self-Supervised Speech Representation Loss Functions
George Close, Thomas Hain, Stefan Goetze

TL;DR
This paper investigates how the language used to train self-supervised speech representations influences speech enhancement performance, finding that training data quantity impacts results more than language match.
Contribution
It systematically evaluates the effect of training language and data quantity on self-supervised representations in speech enhancement models.
Findings
Language match has minor impact on performance.
Training data quantity significantly affects enhancement results.
Models trained with more data perform better across languages.
Abstract
Recent work in the field of speech enhancement (SE) has involved the use of self-supervised speech representations (SSSRs) as feature transformations in loss functions. However, in prior work, very little attention has been paid to the relationship between the language of the audio used to train the self-supervised representation and that used to train the SE system. Enhancement models trained using a loss function which incorporates a self-supervised representation that shares exactly the language of the noisy data used to train the SE system show better performance than those which do not match exactly. This may lead to enhancement systems which are language specific and as such do not generalise well to unseen languages, unlike models trained using traditional spectrogram or time domain loss functions. In this work, SE models are trained and tested on a number of different languages,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Infant Health and Development
