JELLY: Joint Emotion Recognition and Context Reasoning with LLMs for   Conversational Speech Synthesis

Jun-Hyeok Cha; Seung-Bin Kim; Hyung-Seok Oh; Seong-Whan Lee

arXiv:2501.04904·cs.CL·January 10, 2025

JELLY: Joint Emotion Recognition and Context Reasoning with LLMs for Conversational Speech Synthesis

Jun-Hyeok Cha, Seung-Bin Kim, Hyung-Seok Oh, Seong-Whan Lee

PDF

Open Access

TL;DR

JELLY is a novel framework that enhances conversational speech synthesis by integrating emotion recognition and context reasoning through fine-tuned LLMs with specialized modules, producing more natural and emotionally appropriate speech.

Contribution

It introduces a new emotion-aware encoder and a fine-tuning approach that improves emotional context modeling in speech synthesis, addressing data scarcity issues.

Findings

01

JELLY outperforms baseline models in emotional speech synthesis.

02

The framework effectively aligns speech emotions with conversational context.

03

It mitigates the scarcity of emotional conversational speech datasets.

Abstract

Recently, there has been a growing demand for conversational speech synthesis (CSS) that generates more natural speech by considering the conversational context. To address this, we introduce JELLY, a novel CSS framework that integrates emotion recognition and context reasoning for generating appropriate speech in conversation by fine-tuning a large language model (LLM) with multiple partial LoRA modules. We propose an Emotion-aware Q-former encoder, which enables the LLM to perceive emotions in speech. The encoder is trained to align speech emotions with text, utilizing datasets of emotional speech. The entire model is then fine-tuned with conversational speech data to infer emotional context for generating emotionally appropriate speech in conversation. Our experimental results demonstrate that JELLY excels in emotional context modeling, synthesizing speech that naturally aligns with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsALIGN