Context and System Fusion in Post-ASR Emotion Recognition with Large Language Models
Pavel Stepachev, Pinzhen Chen, Barry Haddow

TL;DR
This paper investigates how large language models can improve post-ASR speech emotion recognition by leveraging context, transcript ranking, and system output fusion, achieving significant accuracy improvements.
Contribution
It introduces novel prompting techniques and system fusion strategies for LLM-based emotion recognition, highlighting the importance of context selection and transcript ranking.
Findings
Conversation context has diminishing returns
Transcript selection metric is crucial
Achieved 20% absolute accuracy improvement
Abstract
Large language models (LLMs) have started to play a vital role in modelling speech and text. To explore the best use of context and multiple systems' outputs for post-ASR speech emotion prediction, we study LLM prompting on a recent task named GenSEC. Our techniques include ASR transcript ranking, variable conversation context, and system output fusion. We show that the conversation context has diminishing returns and the metric used to select the transcript for prediction is crucial. Finally, our best submission surpasses the provided baseline by 20% in absolute accuracy.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition
