Acoustic and Semantic Modeling of Emotion in Spoken Language
Soumya Dutta

TL;DR
This thesis explores joint acoustic and semantic modeling of emotions in speech, proposing new methods for emotion-aware representation learning, recognition in conversations, and emotion style transfer to enhance AI understanding and synthesis of human emotions.
Contribution
It introduces novel strategies for emotion-aware speech representation, hierarchical models for conversational emotion recognition, and a textless speech-to-speech emotion transfer framework, advancing multimodal emotion modeling in AI.
Findings
Improved emotion transfer in speech synthesis.
Enhanced emotion recognition accuracy using proposed models.
Effective data augmentation through style-transferred speech.
Abstract
Emotions play a central role in human communication, shaping trust, engagement, and social interaction. As artificial intelligence systems powered by large language models become increasingly integrated into everyday life, enabling them to reliably understand and generate human emotions remains an important challenge. While emotional expression is inherently multimodal, this thesis focuses on emotions conveyed through spoken language and investigates how acoustic and semantic information can be jointly modeled to advance both emotion understanding and emotion synthesis from speech. The first part of the thesis studies emotion-aware representation learning through pre-training. We propose strategies that incorporate acoustic and semantic supervision to learn representations that better capture affective cues in speech. A speech-driven supervised pre-training framework is also introduced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Sentiment Analysis and Opinion Mining · Speech Recognition and Synthesis
