A Theory-Based Explainable Deep Learning Architecture for Music Emotion
Hortense Fong, Vineet Kumar, K. Sudhir

TL;DR
This paper introduces a theory-based, explainable CNN model for predicting music-induced emotions, leveraging acoustic physics principles to improve interpretability and performance over traditional models, with applications in digital advertising.
Contribution
The paper presents a novel, theory-informed CNN architecture that enhances explainability and maintains predictive accuracy in modeling music emotions, outperforming handcrafted feature-based models.
Findings
The model achieves comparable predictive performance to atheoretical deep learning models.
Harmonics-based CNN filters improve explainability of emotional predictions.
Emotionally similar ad placements increase engagement and recall.
Abstract
This paper paper develops a theory-based, explainable deep learning convolutional neural network (CNN) classifier to predict the time-varying emotional response to music. We design novel CNN filters that leverage the frequency harmonics structure from acoustic physics known to impact the perception of musical features. Our theory-based model is more parsimonious, but provides comparable predictive performance to atheoretical deep learning models, while performing better than models using handcrafted features. Our model can be complemented with handcrafted features, but the performance improvement is marginal. Importantly, the harmonics-based structure placed on the CNN filters provides better explainability for how the model predicts emotional response (valence and arousal), because emotion is closely related to consonance--a perceptual feature defined by the alignment of harmonics.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Generative Adversarial Networks and Image Synthesis
