MeWEHV: Mel and Wave Embeddings for Human Voice Tasks

Andr\'es Carofilis; Laura Fern\'andez-Robles; Enrique Alegre; Eduardo; Fidalgo

arXiv:2209.14078·cs.SD·June 27, 2023

MeWEHV: Mel and Wave Embeddings for Human Voice Tasks

Andr\'es Carofilis, Laura Fern\'andez-Robles, Enrique Alegre, Eduardo, Fidalgo

PDF

Open Access

TL;DR

This paper introduces MeWEHV, a new speech embedding model combining waveform and MFCC features, which improves performance on speaker, language, and accent identification tasks with minimal extra computation.

Contribution

It proposes a novel pipeline that integrates waveform and deep MFCC features for robust speech embeddings, enhancing multiple voice task performances.

Findings

01

Significant performance improvements on all tested datasets.

02

Introduction of the YouSpeakers204 dataset for balanced accent and speaker analysis.

03

Low additional computational cost compared to existing models.

Abstract

A recent trend in speech processing is the use of embeddings created through machine learning models trained on a specific task with large datasets. By leveraging the knowledge already acquired, these models can be reused in new tasks where the amount of available data is small. This paper proposes a pipeline to create a new model, called Mel and Wave Embeddings for Human Voice Tasks (MeWEHV), capable of generating robust embeddings for speech processing. MeWEHV combines the embeddings generated by a pre-trained raw audio waveform encoder model, and deep features extracted from Mel Frequency Cepstral Coefficients (MFCCs) using Convolutional Neural Networks (CNNs). We evaluate the performance of MeWEHV on three tasks: speaker, language, and accent identification. For the first one, we use the VoxCeleb1 dataset and present YouSpeakers204, a new and publicly available dataset for English…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing