SwiftF0: Fast and Accurate Monophonic Pitch Detection
Lars Nieradzik

TL;DR
SwiftF0 is a lightweight neural network that achieves state-of-the-art monophonic pitch estimation in noisy conditions with high speed and accuracy, suitable for real-time applications on resource-limited devices.
Contribution
The paper introduces SwiftF0, a novel efficient neural model for pitch detection, and a synthetic speech dataset for better training and evaluation.
Findings
Achieves 91.80% harmonic mean at 10 dB SNR, outperforming baselines.
Runs 42x faster than CREPE on CPU with only 95,842 parameters.
Provides a new synthetic dataset and comprehensive evaluation metrics.
Abstract
Accurate and real-time monophonic pitch estimation in noisy conditions, particularly on resource-constrained devices, remains an open challenge in audio processing. We present \emph{SwiftF0}, a novel, lightweight neural model that sets a new state-of-the-art for monophonic pitch estimation. Through training on diverse speech, music, and synthetic datasets with extensive data augmentation, SwiftF0 achieves robust generalization across acoustic domains while maintaining computational efficiency. SwiftF0 achieves a 91.80\% harmonic mean (HM) at 10 dB SNR, outperforming baselines like CREPE by over 12 percentage points and degrading by only 2.3 points from clean audio. SwiftF0 requires only 95,842 parameters and runs approximately 42x faster than CREPE on CPU, making it ideal for efficient, real-time deployment. To address the critical lack of perfectly accurate ground truth pitch in speech…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
