WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling
Guanrou Yang, Tian Tan, Qian Chen, Zhikang Niu, Yakun Song, Ziyang Ma, Yushen Chen, Zeyu Xie, Tianrui Wang, Yifan Yang, Wenxi Chen, Qi Chen, Wenrui Liu, Shan Yang, and Xie Chen

TL;DR
WavCube is a unified speech model that combines understanding, reconstruction, and generation by compressing SSL features into a semantic-acoustic latent, enabling diverse speech tasks with high efficiency.
Contribution
WavCube introduces a two-stage training scheme to create a compact, unified speech representation supporting multiple tasks, improving over existing SSL features for generative modeling.
Findings
Approaches WavLM performance on SUPERB despite 8x compression.
Achieves state-of-the-art zero-shot TTS with faster convergence.
Excels in speech enhancement, separation, and voice conversion.
Abstract
Integrating speech understanding and generation is a pivotal step toward building unified speech models. However, the different representations required for these two tasks currently pose significant compatibility challenges. Typically, semantics-oriented features are learned from self-supervised learning (SSL), and acoustic-oriented features from reconstruction. Such fragmented representations hinder the realization of truly unified speech systems. We present WavCube, a compact continuous latent derived from an SSL speech encoder that simultaneously supports speech understanding, reconstruction, and generation. WavCube employs a two-stage training scheme. Stage 1 trains a semantic bottleneck to filter off-manifold redundancy that makes raw SSL features intractable for diffusion. Stage 2 injects fine-grained acoustic details via end-to-end reconstruction, while a semantic anchoring loss…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
