Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation
Zengwei Yao, Wei Kang, Han Zhu, Liyong Guo, Lingxuan Ye, Fangjun Kuang, Weiji Zhuang, Zhaoqing Li, Zhifeng Han, Long Lin, Daniel Povey

TL;DR
Flow2GAN introduces a hybrid approach combining improved Flow Matching and GAN fine-tuning with multi-resolution architecture, enabling high-quality, few-step audio generation with better efficiency than existing methods.
Contribution
The paper presents a novel two-stage framework that enhances Flow Matching for audio, then fine-tunes with lightweight GANs and a multi-resolution network for efficient high-fidelity audio synthesis.
Findings
Achieves high-quality audio with as few as 1-4 steps.
Outperforms existing methods in quality-efficiency trade-offs.
Demonstrates effective multi-resolution processing improves modeling capabilities.
Abstract
Existing dominant methods for audio generation include Generative Adversarial Networks (GANs) and diffusion-based methods like Flow Matching. GANs suffer from slow convergence during training, while diffusion methods require multi-step inference that introduces considerable computational overhead. In this work, we introduce Flow2GAN, a two-stage framework that combines Flow Matching training for learning generative capabilities with GAN fine-tuning for efficient few-step inference. Specifically, given audio's unique properties, we first improve Flow Matching for audio modeling through: 1) reformulating the objective as endpoint estimation, avoiding velocity estimation difficulties when involving empty regions; 2) applying spectral energy-based loss scaling to emphasize perceptually salient quieter regions. Building on these Flow Matching adaptations, we demonstrate that a further stage…
Peer Reviews
Decision·ICLR 2026 Poster
1. The two-stage design effectively combines the stable training of Flow Matching and the efficient fine-grained generation of GAN, addressing the slow convergence/mode collapse of GANs and high computational cost of diffusion methods. 2. For audio’s unique properties, the authors propose endpoint estimation and spectral energy-adaptive loss scaling to improve Flow Matching, significantly enhancing generation quality in silent regions and perceptual consistency. 3. The multi-resolution network s
1. Compared to BigVGAN-v2 trained on a larger dataset, it still has a slight gap in some metrics, suggesting limitations in generalization to larger-scale data. 2. The one-step model’s performance at low bandwidth (1.5 kbps) is inferior to its two-step version and some competitors, leaving room for improvement in low-bandwidth audio generation.
The paper's core concept of a two-stage training paradigm is well-motivated and presents a clever solution to a known trade-off in generative modeling. The approach logically leverages Flow Matching for robust, global structure learning and then uses a fast GAN fine-tuning stage for refining high-frequency details, which is an effective strategy. The proposed modifications to the Flow Matching objective appear sound; the shift to endpoint prediction is an intuitive way to handle silent regions i
Despite the strong results, the paper has several weaknesses in its positioning and methodological clarity that should be addressed. First, the proposed "spectral energy-adaptive loss scaling" is conceptually very similar to the "energy balanced loss" used in prior work like RFWave, yet the paper fails to discuss, or compare against it. This omission makes it difficult to assess the novelty of this specific contribution. Second, the reformulation of the prediction target from velocity to endpoin
Hybrid Paradigm: Effectively merges FM’s stable training with GAN’s detail refinement, resolving FM’s slow inference and GAN’s mode collapse issues. Audio-Specific FM Improvements: Endpoint prediction and energy-adaptive loss significantly boost FM’s performance, validating their utility for audio synthesis. Strong Experimental Results: Outperforms Vocos, RFWave, and WaveFM on most metrics (Table1/2) for Mel-spectrogram and Encodec token conditioning. Multi-Resolution Network: Enhances frequency
Training Complexity: Two-stage training (FM pre-training + GAN fine-tuning) increases the barrier to reproduction and deployment—users need to manage separate pipelines and hyperparameters for each stage. Incomplete and Unconvincing Comparison Landscape: a. Lack of coverage of GAN-enhanced Flow Matching models: The paper claims novelty in its hybrid FM+GAN framework, but many existing models leverage GANs to accelerate Flow Matching. A detailed comparison to these models—including their design
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies · Speech and Audio Processing
