SLAM-Omni: Timbre-Controllable Voice Interaction System with   Single-Stage Training

Wenxi Chen; Ziyang Ma; Ruiqi Yan; Yuzhe Liang; Xiquan Li; Ruiyang Xu,; Zhikang Niu; Yanqiao Zhu; Yifan Yang; Zhanxun Liu; Kai Yu; Yuxuan Hu; Jinyu; Li; Yan Lu; Shujie Liu; Xie Chen

arXiv:2412.15649·eess.AS·December 23, 2024·ACL

SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training

Wenxi Chen, Ziyang Ma, Ruiqi Yan, Yuzhe Liang, Xiquan Li, Ruiyang Xu,, Zhikang Niu, Yanqiao Zhu, Yifan Yang, Zhanxun Liu, Kai Yu, Yuxuan Hu, Jinyu, Li, Yan Lu, Shujie Liu, Xie Chen

PDF

Open Access 1 Repo 1 Models 5 Datasets

TL;DR

SLAM-Omni is a novel end-to-end voice interaction system that enables zero-shot timbre control and efficient multi-turn dialogue with single-stage training, outperforming prior models with limited data and training time.

Contribution

It introduces a single-stage training approach for a timbre-controllable spoken dialogue system, eliminating the need for pre-training on TTS or ASR tasks.

Findings

01

Outperforms prior models of similar scale.

02

Requires only 15 hours of training on 4 GPUs.

03

Achieves competitive performance with limited data.

Abstract

Recent advancements highlight the potential of end-to-end real-time spoken dialogue systems, showcasing their low latency and high quality. In this paper, we introduce SLAM-Omni, a timbre-controllable, end-to-end voice interaction system with single-stage training. SLAM-Omni achieves zero-shot timbre control by modeling spoken language with semantic tokens and decoupling speaker information to a vocoder. By predicting grouped speech semantic tokens at each step, our method significantly reduces the sequence length of audio tokens, accelerating both training and inference. Additionally, we propose historical text prompting to compress dialogue history, facilitating efficient multi-round interactions. Comprehensive evaluations reveal that SLAM-Omni outperforms prior models of similar scale, requiring only 15 hours of training on 4 GPUs with limited data. Notably, it is the first spoken…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

X-LANCE/SLAM-LLM
pytorchOfficial

Models

🤗
tutu0604/UltraVoice-SFT
model· 6 dl
6 dl

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis