Federated Self-supervised Speech Representations: Are We There Yet?
Yan Gao, Javier Fernandez-Marques, Titouan Parcollet, Abhinav, Mehrotra, Nicholas D. Lane

TL;DR
This paper systematically examines the challenges of combining self-supervised learning and federated learning for speech models, revealing current limitations and future research directions to enable practical deployment.
Contribution
It provides the first comprehensive analysis of the feasibility, complexities, and bottlenecks of training speech SSL models with federated learning, highlighting key research opportunities.
Findings
Current system constraints hinder SSL and FL integration for speech
Hardware and algorithmic bottlenecks delay practical deployment until 2027
Identifies research directions to overcome existing limitations
Abstract
The ubiquity of microphone-enabled devices has lead to large amounts of unlabelled audio data being produced at the edge. The integration of self-supervised learning (SSL) and federated learning (FL) into one coherent system can potentially offer data privacy guarantees while also advancing the quality and robustness of speech representations. In this paper, we provide a first-of-its-kind systematic study of the feasibility and complexities for training speech SSL models under FL scenarios from the perspective of algorithms, hardware, and systems limits. Despite the high potential of their combination, we find existing system constraints and algorithmic behaviour make SSL and FL systems nearly impossible to build today. Yet critically, our results indicate specific performance bottlenecks and research opportunities that would allow this situation to be reversed. While our analysis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
