TL;DR
DualCodec is a novel neural audio codec that combines semantic and waveform representations to achieve high-quality speech synthesis at low frame rates, improving efficiency and performance.
Contribution
It introduces a dual-stream encoding framework that enhances semantic information in low-frame-rate codecs, outperforming existing state-of-the-art systems.
Findings
Outperforms Mimi Codec, SpeechTokenizer, DAC, and Encodec in experiments.
Maintains high audio quality at low frame rates.
Enhances semantic content in speech generation.
Abstract
Neural audio codecs form the foundational building blocks for language model (LM)-based speech generation. Typically, there is a trade-off between frame rate and audio quality. This study introduces a low-frame-rate, semantically enhanced codec model. Existing approaches distill semantically rich self-supervised (SSL) representations into the first-layer codec tokens. This work proposes DualCodec, a dual-stream encoding approach that integrates SSL and waveform representations within an end-to-end codec framework. In this setting, DualCodec enhances the semantic information in the first-layer codec and enables the codec system to maintain high audio quality while operating at a low frame rate. Note that a low-frame-rate codec improves the efficiency of speech generation. Experimental results on audio codec and speech generation tasks confirm the effectiveness of the proposed DualCodec…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
