Speech Emotion Diarization: Which Emotion Appears When?
Yingzhi Wang, Mirco Ravanelli, Alya Yacoubi

TL;DR
This paper introduces Speech Emotion Diarization (SED), a new task that identifies when specific emotions occur in speech, supported by a new dataset, ZED, and baseline models for evaluation.
Contribution
The paper proposes SED as a novel fine-grained approach to speech emotion analysis, along with the ZED dataset and baseline solutions.
Findings
Introduction of the SED task and ZED dataset
Baseline models for emotion segmentation provided
Open-source code and models available
Abstract
Speech Emotion Recognition (SER) typically relies on utterance-level solutions. However, emotions conveyed through speech should be considered as discrete speech events with definite temporal boundaries, rather than attributes of the entire utterance. To reflect the fine-grained nature of speech emotions, we propose a new task: Speech Emotion Diarization (SED). Just as Speaker Diarization answers the question of "Who speaks when?", Speech Emotion Diarization answers the question of "Which emotion appears when?". To facilitate the evaluation of the performance and establish a common benchmark for researchers, we introduce the Zaion Emotion Dataset (ZED), an openly accessible speech emotion dataset that includes non-acted emotions recorded in real-life conditions, along with manually-annotated boundaries of emotion segments within the utterance. We provide competitive baselines and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Emotion and Mood Recognition · Sentiment Analysis and Opinion Mining
