Exploring Acoustic Similarity in Emotional Speech and Music via   Self-Supervised Representations

Yujia Sun; Zeyu Zhao; Korin Richmond; Yuanchao Li

arXiv:2409.17899·eess.AS·May 1, 2025

Exploring Acoustic Similarity in Emotional Speech and Music via Self-Supervised Representations

Yujia Sun, Zeyu Zhao, Korin Richmond, Yuanchao Li

PDF

Open Access

TL;DR

This paper investigates the shared acoustic features between emotional speech and music using self-supervised models, analyzing their behaviors, cross-domain adaptation, and emotion biases to improve emotion recognition systems.

Contribution

It provides the first detailed analysis of SSL models' layerwise behavior in cross-domain emotion recognition and demonstrates effective fine-tuning strategies leveraging music and speech data.

Findings

01

SSL models capture shared acoustic features between speech and music

02

Parameter-efficient fine-tuning improves emotion recognition performance

03

Emotion biases exist in SSL models for both speech and music

Abstract

Emotion recognition from speech and music shares similarities due to their acoustic overlap, which has led to interest in transferring knowledge between these domains. However, the shared acoustic cues between speech and music, particularly those encoded by Self-Supervised Learning (SSL) models, remain largely unexplored, given the fact that SSL models for speech and music have rarely been applied in cross-domain research. In this work, we revisit the acoustic similarity between emotion speech and music, starting with an analysis of the layerwise behavior of SSL models for Speech Emotion Recognition (SER) and Music Emotion Recognition (MER). Furthermore, we perform cross-domain adaptation by comparing several approaches in a two-stage fine-tuning process, examining effective ways to utilize music for SER and speech for MER. Lastly, we explore the acoustic similarities between emotional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing