SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec

Chunyu Qiang; Haoyu Wang; Cheng Gong; Tianrui Wang; Ruibo Fu; Tao Wang; Ruilong Chen; Jiangyan Yi; Zhengqi Wen; Chen Zhang; Longbiao Wang; Jianwu Dang; Jianhua Tao

arXiv:2508.02849·eess.AS·August 6, 2025

SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec

Chunyu Qiang, Haoyu Wang, Cheng Gong, Tianrui Wang, Ruibo Fu, Tao Wang, Ruilong Chen, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Longbiao Wang, Jianwu Dang, Jianhua Tao

PDF

1 Models

TL;DR

SecoustiCodec is a novel cross-modal speech codec that effectively separates semantic and paralinguistic information, enabling high-quality streaming speech reconstruction at ultra-low bitrates with aligned speech and text representations.

Contribution

It introduces a unified, low-bitrate streaming speech codec with semantic completeness, disentanglement of information types, and a multimodal alignment strategy, advancing speech coding technology.

Findings

01

Achieves state-of-the-art PESQ scores at 0.27 and 1 kbps.

02

Effectively disentangles semantic and paralinguistic information.

03

Supports streaming speech reconstruction with high fidelity.

Abstract

Speech codecs serve as a crucial bridge in unifying speech and text language models. Existing codec methods face several challenges in semantic encoding, such as residual paralinguistic information (e.g., timbre, emotion), insufficient semantic completeness, limited reconstruction capability, and lack of support for streaming. To address these challenges, we propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codec that disentangles semantic and paralinguistic information in a single-codebook space. To ensure semantic completeness and reconstruction fidelity, paralinguistic encoding is introduced to bridge the information gap between semantic and acoustic encoding. A semantic-only efficient quantization method based on VAE (Variational Autoencoder) and FSQ (Finite Scalar Quantization) is proposed. This approach alleviates the long-tail distribution problem of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
qiangchunyu/SecoustiCodec
model· ♡ 7
♡ 7

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.