PitchFlower: A flow-based neural audio codec with pitch controllability

Diego Torres; Axel Roebel; Nicolas Obin

arXiv:2510.25566·eess.AS·October 30, 2025

PitchFlower: A flow-based neural audio codec with pitch controllability

Diego Torres, Axel Roebel, Nicolas Obin

PDF

Open Access 1 Models

TL;DR

PitchFlower is a novel flow-based neural audio codec that enables explicit pitch control and high-quality speech synthesis by disentangling pitch from other speech attributes during training.

Contribution

It introduces a simple perturbation method for disentangling pitch, combining flow-based decoding with vector quantization to improve controllability and audio quality.

Findings

01

Achieves more accurate pitch control than WORLD.

02

Outperforms SiFiGAN in controllability with comparable quality.

03

Provides a framework for disentangling speech attributes.

Abstract

We present PitchFlower, a flow-based neural audio codec with explicit pitch controllability. Our approach enforces disentanglement through a simple perturbation: during training, F0 contours are flattened and randomly shifted, while the true F0 is provided as conditioning. A vector-quantization bottleneck prevents pitch recovery, and a flow-based decoder generates high quality audio. Experiments show that PitchFlower achieves more accurate pitch control than WORLD at much higher audio quality, and outperforms SiFiGAN in controllability while maintaining comparable quality. Beyond pitch, this framework provides a simple and extensible path toward disentangling other speech attributes.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
diegotg343/PitchFlower
model· 25 dl· ♡ 3
25 dl♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing