MeWEHV: Mel and Wave Embeddings for Human Voice Tasks
Andr\'es Carofilis, Laura Fern\'andez-Robles, Enrique Alegre, Eduardo, Fidalgo

TL;DR
This paper introduces MeWEHV, a new speech embedding model combining waveform and MFCC features, which improves performance on speaker, language, and accent identification tasks with minimal extra computation.
Contribution
It proposes a novel pipeline that integrates waveform and deep MFCC features for robust speech embeddings, enhancing multiple voice task performances.
Findings
Significant performance improvements on all tested datasets.
Introduction of the YouSpeakers204 dataset for balanced accent and speaker analysis.
Low additional computational cost compared to existing models.
Abstract
A recent trend in speech processing is the use of embeddings created through machine learning models trained on a specific task with large datasets. By leveraging the knowledge already acquired, these models can be reused in new tasks where the amount of available data is small. This paper proposes a pipeline to create a new model, called Mel and Wave Embeddings for Human Voice Tasks (MeWEHV), capable of generating robust embeddings for speech processing. MeWEHV combines the embeddings generated by a pre-trained raw audio waveform encoder model, and deep features extracted from Mel Frequency Cepstral Coefficients (MFCCs) using Convolutional Neural Networks (CNNs). We evaluate the performance of MeWEHV on three tasks: speaker, language, and accent identification. For the first one, we use the VoxCeleb1 dataset and present YouSpeakers204, a new and publicly available dataset for English…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
