WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling

Guanrou Yang; Tian Tan; Qian Chen; Zhikang Niu; Yakun Song; Ziyang Ma; Yushen Chen; Zeyu Xie; Tianrui Wang; Yifan Yang; Wenxi Chen; Qi Chen; Wenrui Liu; Shan Yang; and Xie Chen

arXiv:2605.06407·eess.AS·May 8, 2026

WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling

Guanrou Yang, Tian Tan, Qian Chen, Zhikang Niu, Yakun Song, Ziyang Ma, Yushen Chen, Zeyu Xie, Tianrui Wang, Yifan Yang, Wenxi Chen, Qi Chen, Wenrui Liu, Shan Yang, and Xie Chen

PDF

1 Repo 1 Models

TL;DR

WavCube is a unified speech model that combines understanding, reconstruction, and generation by compressing SSL features into a semantic-acoustic latent, enabling diverse speech tasks with high efficiency.

Contribution

WavCube introduces a two-stage training scheme to create a compact, unified speech representation supporting multiple tasks, improving over existing SSL features for generative modeling.

Findings

01

Approaches WavLM performance on SUPERB despite 8x compression.

02

Achieves state-of-the-art zero-shot TTS with faster convergence.

03

Excels in speech enhancement, separation, and voice conversion.

Abstract

Integrating speech understanding and generation is a pivotal step toward building unified speech models. However, the different representations required for these two tasks currently pose significant compatibility challenges. Typically, semantics-oriented features are learned from self-supervised learning (SSL), and acoustic-oriented features from reconstruction. Such fragmented representations hinder the realization of truly unified speech systems. We present WavCube, a compact continuous latent derived from an SSL speech encoder that simultaneously supports speech understanding, reconstruction, and generation. WavCube employs a two-stage training scheme. Stage 1 trains a semantic bottleneck to filter off-manifold redundancy that makes raw SSL features intractable for diffusion. Stage 2 injects fine-grained acoustic details via end-to-end reconstruction, while a semantic anchoring loss…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yanghaha0908/WavCube
github

Models

🤗
yhaha/WavCube
model· 24 dl· ♡ 1
24 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.