Lightweight Self-Supervised Detection of Fundamental Frequency and Accurate Probability of Voicing in Monophonic Music

Venkat Suprabath Bitra; Homayoon Beigi

arXiv:2601.11768·eess.AS·January 21, 2026

Lightweight Self-Supervised Detection of Fundamental Frequency and Accurate Probability of Voicing in Monophonic Music

Venkat Suprabath Bitra, Homayoon Beigi

PDF

Open Access

TL;DR

This paper introduces a lightweight, self-supervised method for accurate fundamental frequency and voicing detection in monophonic music, effective with limited data and robust to recording artifacts.

Contribution

It presents a novel self-supervised framework using transposition-equivariant learning and an EM-style reweighting scheme for joint F0 and voicing estimation without manual labels.

Findings

01

Achieves high accuracy on MedleyDB (RPA 95.84, RCA 96.24)

02

Demonstrates strong cross-instrument generalization

03

Operates effectively with limited training data

Abstract

Reliable fundamental frequency (F 0) and voicing estimation is essential for neural synthesis, yet many pitch extractors depend on large labeled corpora and degrade under realistic recording artifacts. We propose a lightweight, fully self-supervised framework for joint F 0 estimation and voicing inference, designed for rapid single-instrument training from limited audio. Using transposition-equivariant learning on CQT features, we introduce an EM-style iterative reweighting scheme that uses Shift Cross-Entropy (SCE) consistency as a reliability signal to suppress uninformative noisy/unvoiced frames. The resulting weights provide confidence scores that enable pseudo-labeling for a separate lightweight voicing classifier without manual annotations. Trained on MedleyDB and evaluated on MDB-stem-synth ground truth, our method achieves competitive cross-corpus performance (RPA 95.84, RCA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis