MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation
Szu-Chi Chen, I-Ning Tsai, Yi-Cheng Lin, Sung-Feng Huang, Hung-yi Lee

TL;DR
This paper introduces MoVE, a novel speech-to-speech translation system that effectively preserves non-verbal vocalizations like laughter and crying, enhancing emotional and pragmatic communication in translated speech.
Contribution
MoVE employs a Mixture-of-LoRA-Experts architecture with specialized adapters and a soft-weighting router, enabling efficient and expressive preservation of non-verbal vocalizations in S2ST.
Findings
MoVE reproduces target NVs in 76% of cases.
Achieves highest human-rated naturalness and emotional fidelity among compared systems.
Requires only 30 minutes of curated data for strong performance.
Abstract
Recent Speech-to-Speech Translation (S2ST) systems achieve strong semantic accuracy yet consistently strip away non-verbal vocalizations (NVs), such as laughter and crying that convey pragmatic intent, which severely limits real-world utility. We address this via three contributions. First, we propose a synthesis pipeline for building scalable expressive datasets to overcome the data scarcity limitation. Second, we propose MoVE, a Mixture-of-LoRA-Experts architecture with expressive-specialized adapters and a soft-weighting router that blends experts for capturing hybrid expressive states. Third, we show pretrained AudioLLMs enable striking data efficiency: 30 minutes of curated data is enough for strong performance. On English-Chinese S2ST, while comparing with strong baselines, MoVE reproduces target NVs in 76% of cases and achieves the highest human-rated naturalness and emotional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
