EmoSLLM: Parameter-Efficient Adaptation of LLMs for Speech Emotion Recognition
Hugo Thimonier, Antony Perzo, Renaud Seguier

TL;DR
This paper introduces EmoSLLM, a parameter-efficient method that fine-tunes large language models with audio and text data for speech emotion recognition, achieving high accuracy with fewer parameters.
Contribution
It presents a novel multimodal fine-tuning approach using LoRA for speech emotion recognition, combining audio features and transcripts in an LLM.
Findings
Outperforms most existing Speech-Text LLMs on standard benchmarks.
Requires less than half the parameters of competing models.
Demonstrates effective multimodal emotion understanding.
Abstract
Emotion recognition from speech is a challenging task that requires capturing both linguistic and paralinguistic cues, with critical applications in human-computer interaction and mental health monitoring. Recent works have highlighted the ability of Large Language Models (LLMs) to perform tasks outside of the sole natural language area. In particular, recent approaches have investigated coupling LLMs with other data modalities by using pre-trained backbones and different fusion mechanisms. This work proposes a novel approach that fine-tunes an LLM with audio and text representations for emotion prediction. Our method first extracts audio features using an audio feature extractor, which are then mapped into the LLM's representation space via a learnable interfacing module. The LLM takes as input (1) the transformed audio features, (2) additional features in the form of natural language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition
