PESTO: Real-Time Pitch Estimation with Self-supervised Transposition-equivariant Objective
Alain Riou, Bernardo Torres, Ben Hayes, Stefan Lattner, Ga\"etan Hadjeres, Ga\"el Richard, Geoffroy Peeters

TL;DR
PESTO is a lightweight, self-supervised, real-time pitch estimation model that uses a transposition-equivariant objective and a Siamese architecture to outperform baselines and generalize across datasets.
Contribution
Introduces PESTO, a self-supervised, transposition-equivariant pitch estimation method with a novel training objective and streamable implementation for real-time use.
Findings
Outperforms self-supervised baselines in pitch estimation.
Competitively matches supervised methods across datasets.
Achieves low latency (<10 ms) suitable for real-time applications.
Abstract
In this paper, we introduce PESTO, a self-supervised learning approach for single-pitch estimation using a Siamese architecture. Our model processes individual frames of a Variable- Transform (VQT) and predicts pitch distributions. The neural network is designed to be equivariant to translations, notably thanks to a Toeplitz fully-connected layer. In addition, we construct pitch-shifted pairs by translating and cropping the VQT frames and train our model with a novel class-based transposition-equivariant objective, eliminating the need for annotated data. Thanks to this architecture and training objective, our model achieves remarkable performances while being very lightweight (k parameters). Evaluations on music and speech datasets (MIR-1K, MDB-stem-synth, and PTDB) demonstrate that PESTO not only outperforms self-supervised baselines but also competes with supervised methods,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
