Lightweight Self-Supervised Detection of Fundamental Frequency and Accurate Probability of Voicing in Monophonic Music
Venkat Suprabath Bitra, Homayoon Beigi

TL;DR
This paper introduces a lightweight, self-supervised method for accurate fundamental frequency and voicing detection in monophonic music, effective with limited data and robust to recording artifacts.
Contribution
It presents a novel self-supervised framework using transposition-equivariant learning and an EM-style reweighting scheme for joint F0 and voicing estimation without manual labels.
Findings
Achieves high accuracy on MedleyDB (RPA 95.84, RCA 96.24)
Demonstrates strong cross-instrument generalization
Operates effectively with limited training data
Abstract
Reliable fundamental frequency (F 0) and voicing estimation is essential for neural synthesis, yet many pitch extractors depend on large labeled corpora and degrade under realistic recording artifacts. We propose a lightweight, fully self-supervised framework for joint F 0 estimation and voicing inference, designed for rapid single-instrument training from limited audio. Using transposition-equivariant learning on CQT features, we introduce an EM-style iterative reweighting scheme that uses Shift Cross-Entropy (SCE) consistency as a reliability signal to suppress uninformative noisy/unvoiced frames. The resulting weights provide confidence scores that enable pseudo-labeling for a separate lightweight voicing classifier without manual annotations. Trained on MedleyDB and evaluated on MDB-stem-synth ground truth, our method achieves competitive cross-corpus performance (RPA 95.84, RCA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
