PESTO: Real-Time Pitch Estimation with Self-supervised Transposition-equivariant Objective

Alain Riou; Bernardo Torres; Ben Hayes; Stefan Lattner; Ga\"etan Hadjeres; Ga\"el Richard; Geoffroy Peeters

arXiv:2508.01488·cs.SD·October 28, 2025

PESTO: Real-Time Pitch Estimation with Self-supervised Transposition-equivariant Objective

Alain Riou, Bernardo Torres, Ben Hayes, Stefan Lattner, Ga\"etan Hadjeres, Ga\"el Richard, Geoffroy Peeters

PDF

TL;DR

PESTO is a lightweight, self-supervised, real-time pitch estimation model that uses a transposition-equivariant objective and a Siamese architecture to outperform baselines and generalize across datasets.

Contribution

Introduces PESTO, a self-supervised, transposition-equivariant pitch estimation method with a novel training objective and streamable implementation for real-time use.

Findings

01

Outperforms self-supervised baselines in pitch estimation.

02

Competitively matches supervised methods across datasets.

03

Achieves low latency (<10 ms) suitable for real-time applications.

Abstract

In this paper, we introduce PESTO, a self-supervised learning approach for single-pitch estimation using a Siamese architecture. Our model processes individual frames of a Variable- $Q$ Transform (VQT) and predicts pitch distributions. The neural network is designed to be equivariant to translations, notably thanks to a Toeplitz fully-connected layer. In addition, we construct pitch-shifted pairs by translating and cropping the VQT frames and train our model with a novel class-based transposition-equivariant objective, eliminating the need for annotated data. Thanks to this architecture and training objective, our model achieves remarkable performances while being very lightweight ( $130$ k parameters). Evaluations on music and speech datasets (MIR-1K, MDB-stem-synth, and PTDB) demonstrate that PESTO not only outperforms self-supervised baselines but also competes with supervised methods,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.