XEmoRAG: Cross-Lingual Emotion Transfer with Controllable Intensity Using Retrieval-Augmented Generation

Tianlun Zuo; Jingbin Hu; Yuke Li; Xinfa Zhu; Hai Li; Ying Yan; Junhui Liu; Danming Xie; Lei Xie

arXiv:2508.07302·eess.AS·August 13, 2025

XEmoRAG: Cross-Lingual Emotion Transfer with Controllable Intensity Using Retrieval-Augmented Generation

Tianlun Zuo, Jingbin Hu, Yuke Li, Xinfa Zhu, Hai Li, Ying Yan, Junhui Liu, Danming Xie, Lei Xie

PDF

Open Access

TL;DR

XEmoRAG is a novel framework that enables zero-shot cross-lingual emotion transfer in speech synthesis, using retrieval-augmented generation to produce natural, expressive speech in a target language without parallel emotional data.

Contribution

The paper introduces XEmoRAG, a new method for zero-shot emotion transfer across languages that leverages language-agnostic embeddings and retrieval, avoiding the need for parallel emotional corpora.

Findings

01

Successfully transfers emotion from Chinese to Thai speech.

02

Produces natural and expressive Thai speech without explicit emotion labels.

03

Maintains speaker characteristics and emotional consistency.

Abstract

Zero-shot emotion transfer in cross-lingual speech synthesis refers to generating speech in a target language, where the emotion is expressed based on reference speech from a different source language. However, this task remains challenging due to the scarcity of parallel multilingual emotional corpora, the presence of foreign accent artifacts, and the difficulty of separating emotion from language-specific prosodic features. In this paper, we propose XEmoRAG, a novel framework to enable zero-shot emotion transfer from Chinese to Thai using a large language model (LLM)-based model, without relying on parallel emotional data. XEmoRAG extracts language-agnostic emotional embeddings from Chinese speech and retrieves emotionally matched Thai utterances from a curated emotional database, enabling controllable emotion transfer without explicit emotion labels. Additionally, a flow-matching…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Phonetics and Phonology Research