WavFlow: Audio Generation in Waveform Space

Feiyan Zhou; Luyuan Wang; Shoufa Chen; Zhe Wang; Zhiheng Liu; Yuren Cong; Xiaohui Zhang; Fanny Yang; Belinda Zeng

arXiv:2605.18749·cs.SD·May 19, 2026

WavFlow: Audio Generation in Waveform Space

Feiyan Zhou, Luyuan Wang, Shoufa Chen, Zhe Wang, Zhiheng Liu, Yuren Cong, Xiaohui Zhang, Fanny Yang, Belinda Zeng

PDF

1 Repo

TL;DR

WavFlow introduces a novel approach for high-fidelity audio generation directly in waveform space, bypassing traditional latent compression, and achieves competitive results on standard benchmarks.

Contribution

This work presents WavFlow, a direct waveform generation framework that simplifies audio synthesis and matches state-of-the-art performance without relying on intermediate representations.

Findings

01

WavFlow achieves competitive scores on VGGSound and AudioCaps benchmarks.

02

The model can learn fine-grained acoustic patterns from scratch using large-scale video-text-audio data.

03

Direct waveform modeling can match or surpass latent-space methods in audio quality.

Abstract

Modern audio generation predominantly relies on latent-space compression, introducing additional complexity and potential information loss. In this work, we challenge this paradigm with WavFlow, a framework that generates high-fidelity audio directly in raw waveform space without intermediate representations. To overcome the inherent difficulties of modeling high-dimensional and low-energy signals, we reshape audio into 2D token grids through waveform patchify and introduce amplitude lifting to align signal scales, enabling stable optimization via direct x-prediction in flow matching. To capture complex semantic alignment and temporal synchronization, we leverage an automated data pipeline to curate 5 million high-quality video-text-audio triplets, allowing the model to learn fine-grained acoustic patterns from scratch. Experimental results show that WavFlow achieves competitive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/WavFlow
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.