TL;DR
SecoustiCodec is a novel cross-modal speech codec that effectively separates semantic and paralinguistic information, enabling high-quality streaming speech reconstruction at ultra-low bitrates with aligned speech and text representations.
Contribution
It introduces a unified, low-bitrate streaming speech codec with semantic completeness, disentanglement of information types, and a multimodal alignment strategy, advancing speech coding technology.
Findings
Achieves state-of-the-art PESQ scores at 0.27 and 1 kbps.
Effectively disentangles semantic and paralinguistic information.
Supports streaming speech reconstruction with high fidelity.
Abstract
Speech codecs serve as a crucial bridge in unifying speech and text language models. Existing codec methods face several challenges in semantic encoding, such as residual paralinguistic information (e.g., timbre, emotion), insufficient semantic completeness, limited reconstruction capability, and lack of support for streaming. To address these challenges, we propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codec that disentangles semantic and paralinguistic information in a single-codebook space. To ensure semantic completeness and reconstruction fidelity, paralinguistic encoding is introduced to bridge the information gap between semantic and acoustic encoding. A semantic-only efficient quantization method based on VAE (Variational Autoencoder) and FSQ (Finite Scalar Quantization) is proposed. This approach alleviates the long-tail distribution problem of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
