Federated Representation Learning for Automatic Speech Recognition
Guruprasad V Ramesh, Gopinath Chennupati, Milind Rao, Anit Kumar Sahu,, Ariya Rastrow, Jasha Droppo

TL;DR
This paper combines federated learning and self-supervised learning to train speech recognition models on privacy-sensitive, unlabeled data distributed across devices, achieving performance comparable to centralized training and improving recognition accuracy.
Contribution
It introduces a novel federated self-supervised learning approach for speech recognition that respects data privacy and demonstrates significant improvements over no pre-training.
Findings
Federated SSL pre-training matches centralized model performance.
Pre-training improves WER by 12-15% on English speech recognition.
Pre-training enhances French speech recognition with 20% WER reduction.
Abstract
Federated Learning (FL) is a privacy-preserving paradigm, allowing edge devices to learn collaboratively without sharing data. Edge devices like Alexa and Siri are prospective sources of unlabeled audio data that can be tapped to learn robust audio representations. In this work, we bring Self-supervised Learning (SSL) and FL together to learn representations for Automatic Speech Recognition respecting data privacy constraints. We use the speaker and chapter information in the unlabeled speech dataset, Libri-Light, to simulate non-IID speaker-siloed data distributions and pre-train an LSTM encoder with the Contrastive Predictive Coding framework with FedSGD. We show that the pre-trained ASR encoder in FL performs as well as a centrally pre-trained model and produces an improvement of 12-15% (WER) compared to no pre-training. We further adapt the federated pre-trained models to a new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Voice and Speech Disorders
MethodsTanh Activation · InfoNCE · Sigmoid Activation · Long Short-Term Memory · Contrastive Predictive Coding
