TL;DR
This paper introduces a new attention-based neural network architecture for extracting emotionally significant highlights from pop songs, demonstrating improved performance over traditional methods and providing open-source code for reproducibility.
Contribution
It proposes a non-recurrent attention-based model with fusion variants for music highlight extraction, advancing previous emotion-based approaches and extensive comparative evaluation.
Findings
The new model outperforms heuristic and structural methods.
Early-fusion attention variant yields better results.
Proposed methods are effective in identifying song highlights.
Abstract
The goal of music highlight extraction is to get a short consecutive segment of a piece of music that provides an effective representation of the whole piece. In a previous work, we introduced an attention-based convolutional recurrent neural network that uses music emotion classification as a surrogate task for music highlight extraction, for Pop songs. The rationale behind that approach is that the highlight of a song is usually the most emotional part. This paper extends our previous work in the following two aspects. First, methodology-wise we experiment with a new architecture that does not need any recurrent layers, making the training process faster. Moreover, we compare a late-fusion variant and an early-fusion variant to study which one better exploits the attention mechanism. Second, we conduct and report an extensive set of experiments comparing the proposed attention-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
