A Baseline Multimodal Approach to Emotion Recognition in Conversations
V\'ictor Yeste, Rodrigo Rivas-Ar\'evalo

TL;DR
This paper provides a simple, accessible multimodal baseline for emotion recognition in conversations, combining text and speech models with late fusion, to serve as a reference for future research.
Contribution
It introduces a lightweight, reproducible multimodal baseline using transformer-based text and self-supervised speech models for emotion recognition in conversations.
Findings
Multimodal fusion can improve emotion recognition over unimodal models.
The baseline setup is effective under limited training conditions.
Provides transparency and a foundation for future comparisons.
Abstract
We present a lightweight multimodal baseline for emotion recognition in conversations using the SemEval-2024 Task 3 dataset built from the sitcom Friends. The goal of this report is not to propose a novel state-of-the-art method, but to document an accessible reference implementation that combines (i) a transformer-based text classifier and (ii) a self-supervised speech representation model, with a simple late-fusion ensemble. We report the baseline setup and empirical results obtained under a limited training protocol, highlighting when multimodal fusion improves over unimodal models. This preprint is provided for transparency and to support future, more rigorous comparisons.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Sentiment Analysis and Opinion Mining · Speech and dialogue systems
