LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations
Soumya Dutta, Sriram Ganapathy

TL;DR
This paper introduces a multimodal emotion recognition approach that leverages unsupervised speech transcripts, LLM-guided pseudo-labeling, and hierarchical training to improve accuracy on conversational datasets.
Contribution
It proposes a novel hierarchical training method combining speech and text embeddings with LLM-guided pseudo-labeling for emotion recognition.
Findings
Achieves state-of-the-art results on IEMOCAP and MELD datasets.
Improves emotion recognition accuracy over existing benchmarks.
Effectively integrates speech and text modalities for conversational emotion analysis.
Abstract
Emotion recognition in conversations (ERC) is challenging due to the multimodal nature of the emotion expression. In this paper, we propose to pretrain a text-based recognition model from unsupervised speech transcripts with LLM guidance. These transcriptions are obtained from a raw speech dataset with a pre-trained ASR system. A text LLM model is queried to provide pseudo-labels for these transcripts, and these pseudo-labeled transcripts are subsequently used for learning an utterance level text-based emotion recognition model. We use the utterance level text embeddings for emotion recognition in conversations along with speech embeddings obtained from a recently proposed pre-trained model. A hierarchical way of training the speech-text model is proposed, keeping in mind the conversational nature of the dataset. We perform experiments on three established datasets, namely, IEMOCAP,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition
