Acoustic and Semantic Modeling of Emotion in Spoken Language

Soumya Dutta

arXiv:2603.09212·eess.AS·March 11, 2026

Acoustic and Semantic Modeling of Emotion in Spoken Language

Soumya Dutta

PDF

Open Access

TL;DR

This thesis explores joint acoustic and semantic modeling of emotions in speech, proposing new methods for emotion-aware representation learning, recognition in conversations, and emotion style transfer to enhance AI understanding and synthesis of human emotions.

Contribution

It introduces novel strategies for emotion-aware speech representation, hierarchical models for conversational emotion recognition, and a textless speech-to-speech emotion transfer framework, advancing multimodal emotion modeling in AI.

Findings

01

Improved emotion transfer in speech synthesis.

02

Enhanced emotion recognition accuracy using proposed models.

03

Effective data augmentation through style-transferred speech.

Abstract

Emotions play a central role in human communication, shaping trust, engagement, and social interaction. As artificial intelligence systems powered by large language models become increasingly integrated into everyday life, enabling them to reliably understand and generate human emotions remains an important challenge. While emotional expression is inherently multimodal, this thesis focuses on emotions conveyed through spoken language and investigates how acoustic and semantic information can be jointly modeled to advance both emotion understanding and emotion synthesis from speech. The first part of the thesis studies emotion-aware representation learning through pre-training. We propose strategies that incorporate acoustic and semantic supervision to learn representations that better capture affective cues in speech. A speech-driven supervised pre-training framework is also introduced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Sentiment Analysis and Opinion Mining · Speech Recognition and Synthesis