Singing Beat Tracking With Self-supervised Front-end and Linear Transformers
Mojtaba Heydari, Zhiyao Duan

TL;DR
This paper introduces the first singing beat tracking system that uses self-supervised speech representations and self-attention, significantly improving accuracy over existing music beat tracking methods for singing voices.
Contribution
It pioneers singing beat tracking as a new task and demonstrates the effectiveness of pre-trained self-supervised speech models in this domain.
Findings
Outperforms state-of-the-art music beat trackers on singing voices
Pre-trained speech representations outperform generic spectral features
Ablation studies confirm the advantages of self-supervised models
Abstract
Tracking beats of singing voices without the presence of musical accompaniment can find many applications in music production, automatic song arrangement, and social media interaction. Its main challenge is the lack of strong rhythmic and harmonic patterns that are important for music rhythmic analysis in general. Even for human listeners, this can be a challenging task. As a result, existing music beat tracking systems fail to deliver satisfactory performance on singing voices. In this paper, we propose singing beat tracking as a novel task, and propose the first approach to solving this task. Our approach leverages semantic information of singing voices by employing pre-trained self-supervised WavLM and DistilHuBERT speech representations as the front-end and uses a self-attention encoder layer to predict beats. To train and test the system, we obtain separated singing voices and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
MethodsTest
