Multi-Modality in Music: Predicting Emotion in Music from High-Level Audio Features and Lyrics
Tibor Krols, Yana Nikolova, Ninell Oldenburg

TL;DR
This study investigates whether combining high-level audio features and lyrics improves music emotion recognition, finding that multi-modal approaches outperform uni-modal audio in predicting valence, with specific features contributing most.
Contribution
It demonstrates that multi-modal features enhance emotion prediction in music over audio-only methods, identifying key features that contribute to improved performance.
Findings
Multi-modal features outperform audio-only in valence prediction.
Five key high-level features significantly contribute to performance.
Code for the approach is publicly available.
Abstract
This paper aims to test whether a multi-modal approach for music emotion recognition (MER) performs better than a uni-modal one on high-level song features and lyrics. We use 11 song features retrieved from the Spotify API, combined lyrics features including sentiment, TF-IDF, and Anew to predict valence and arousal (Russell, 1980) scores on the Deezer Mood Detection Dataset (DMDD) (Delbouys et al., 2018) with 4 different regression models. We find that out of the 11 high-level song features, mainly 5 contribute to the performance, multi-modal features do better than audio alone when predicting valence. We made our code publically available.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Neuroscience and Music Perception
MethodsTest
