Context and System Fusion in Post-ASR Emotion Recognition with Large   Language Models

Pavel Stepachev; Pinzhen Chen; Barry Haddow

arXiv:2410.03312·cs.CL·October 7, 2024

Context and System Fusion in Post-ASR Emotion Recognition with Large Language Models

Pavel Stepachev, Pinzhen Chen, Barry Haddow

PDF

Open Access 1 Repo

TL;DR

This paper investigates how large language models can improve post-ASR speech emotion recognition by leveraging context, transcript ranking, and system output fusion, achieving significant accuracy improvements.

Contribution

It introduces novel prompting techniques and system fusion strategies for LLM-based emotion recognition, highlighting the importance of context selection and transcript ranking.

Findings

01

Conversation context has diminishing returns

02

Transcript selection metric is crucial

03

Achieved 20% absolute accuracy improvement

Abstract

Large language models (LLMs) have started to play a vital role in modelling speech and text. To explore the best use of context and multiple systems' outputs for post-ASR speech emotion prediction, we study LLM prompting on a recent task named GenSEC. Our techniques include ASR transcript ranking, variable conversation context, and system output fusion. We show that the conversation context has diminishing returns and the metric used to select the transcript for prediction is crucial. Finally, our best submission surpasses the provided baseline by 20% in absolute accuracy.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rggdmonk/GenSEC-Task-3
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition