Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs
Wei-Cheng Tseng, David Harwath

TL;DR
Codec2Vec introduces a self-supervised speech representation learning method based on neural speech codecs, achieving competitive performance with significantly improved data efficiency and privacy, suitable for various speech processing tasks.
Contribution
It is the first framework to use discrete audio codec units for self-supervised speech representation learning, enhancing efficiency and privacy.
Findings
Achieves competitive results on the SUPERB benchmark.
Reduces storage needs by up to 16.5 times.
Speeds up training by 2.3 times.
Abstract
Recent advancements in neural audio codecs have not only enabled superior audio compression but also enhanced speech synthesis techniques. Researchers are now exploring their potential as universal acoustic feature extractors for a broader range of speech processing tasks. Building on this trend, we introduce Codec2Vec, the first speech representation learning framework that relies exclusively on discrete audio codec units. This approach offers several advantages, including improved data storage and transmission efficiency, faster training, and enhanced data privacy. We explore masked prediction with various training target derivation strategies to thoroughly understand the effectiveness of this framework. Evaluated on the SUPERB benchmark, Codec2Vec achieves competitive performance compared to continuous-input models while reducing storage requirements by up to 16.5x and training time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders
