MeanVoiceFlow: One-step Nonparallel Voice Conversion with Mean Flows

Takuhiro Kaneko; Hirokazu Kameoka; Kou Tanaka; Yuto Kondo

arXiv:2602.18104·cs.SD·February 23, 2026

MeanVoiceFlow: One-step Nonparallel Voice Conversion with Mean Flows

Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Yuto Kondo

PDF

Open Access

TL;DR

MeanVoiceFlow introduces a one-step, nonparallel voice conversion model based on mean flows, achieving high-quality speech conversion without pretraining or distillation, and matching multi-step models' performance.

Contribution

It proposes a novel mean flow-based approach for one-step voice conversion, with new training techniques to ensure stability and source information utilization.

Findings

01

Achieves comparable performance to multi-step models

02

Effective use of mean flows for single-step inference

03

Introduces structural margin reconstruction loss and diffused-input training

Abstract

In voice conversion (VC) applications, diffusion and flow-matching models have exhibited exceptional speech quality and speaker similarity performances. However, they are limited by slow conversion owing to their iterative inference. Consequently, we propose MeanVoiceFlow, a novel one-step nonparallel VC model based on mean flows, which can be trained from scratch without requiring pretraining or distillation. Unlike conventional flow matching that uses instantaneous velocity, mean flows employ average velocity to more accurately compute the time integral along the inference path in a single step. However, training the average velocity requires its derivative to compute the target velocity, which can cause instability. Therefore, we introduce a structural margin reconstruction loss as a zero-input constraint, which moderately regularizes the input-output behavior of the model without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Speech and Audio Processing