A Baseline Multimodal Approach to Emotion Recognition in Conversations

V\'ictor Yeste; Rodrigo Rivas-Ar\'evalo

arXiv:2602.00914·cs.CL·February 3, 2026

A Baseline Multimodal Approach to Emotion Recognition in Conversations

V\'ictor Yeste, Rodrigo Rivas-Ar\'evalo

PDF

Open Access

TL;DR

This paper provides a simple, accessible multimodal baseline for emotion recognition in conversations, combining text and speech models with late fusion, to serve as a reference for future research.

Contribution

It introduces a lightweight, reproducible multimodal baseline using transformer-based text and self-supervised speech models for emotion recognition in conversations.

Findings

01

Multimodal fusion can improve emotion recognition over unimodal models.

02

The baseline setup is effective under limited training conditions.

03

Provides transparency and a foundation for future comparisons.

Abstract

We present a lightweight multimodal baseline for emotion recognition in conversations using the SemEval-2024 Task 3 dataset built from the sitcom Friends. The goal of this report is not to propose a novel state-of-the-art method, but to document an accessible reference implementation that combines (i) a transformer-based text classifier and (ii) a self-supervised speech representation model, with a simple late-fusion ensemble. We report the baseline setup and empirical results obtained under a limited training protocol, highlighting when multimodal fusion improves over unimodal models. This preprint is provided for transparency and to support future, more rigorous comparisons.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Sentiment Analysis and Opinion Mining · Speech and dialogue systems