WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

Yifu Chen; Shengpeng Ji; Qian Chen; Tianle Liang; Yangzhuo Li; Ziqing Wang; Wen Wang; Jingyu Lu; Haoxiao Wang; Xueyi Pu; Fan Zhuo; Zhou Zhao

arXiv:2604.14932·cs.AI·April 17, 2026

WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

Yifu Chen, Shengpeng Ji, Qian Chen, Tianle Liang, Yangzhuo Li, Ziqing Wang, Wen Wang, Jingyu Lu, Haoxiao Wang, Xueyi Pu, Fan Zhuo, Zhou Zhao

PDF

TL;DR

This paper introduces WavAlign, a novel adaptive post-training method that enhances spoken dialogue models' expressiveness and intelligence by combining reward modeling and dynamic regulation, leading to improved semantic and acoustic quality.

Contribution

The paper proposes a modality-aware adaptive post-training approach that makes reinforcement learning practical for spoken dialogue models, improving their semantic and expressive capabilities.

Findings

01

Consistent improvements in semantic quality across benchmarks.

02

Enhanced speech expressiveness demonstrated in experiments.

03

Effective regulation of preference gradients improves training stability.

Abstract

End-to-end spoken dialogue models have garnered significant attention because they offer a higher potential ceiling in expressiveness and perceptual ability than cascaded systems. However, the intelligence and expressiveness of current open-source spoken dialogue models often remain below expectations. Motivated by the success of online reinforcement learning(RL) in other domains, one might attempt to directly apply preference optimization to spoken dialogue models, yet this transfer is non-trivial. We analyze these obstacles from the perspectives of reward modeling and rollout sampling, focusing on how sparse preference supervision interacts with dense speech generation under shared-parameter updates. Based on the analysis, we propose a modality-aware adaptive post-training recipe that makes RL practical for spoken dialogue: it constrains preference updates to the semantic channel and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.