AC-VC: Non-parallel Low Latency Phonetic Posteriorgrams Based Voice Conversion
Damien Ronssin, Milos Cernak

TL;DR
This paper introduces AC-VC, a low-latency, non-parallel voice conversion system using phonetic posteriorgrams that achieves real-time performance with minimal future context, matching baseline naturalness but with some speaker similarity trade-offs.
Contribution
The paper proposes a novel almost causal voice conversion system with only 57.5 ms look-ahead, enabling real-time application while maintaining high naturalness.
Findings
Achieves naturalness comparable to non-causal baseline (MOS 3.5).
Maintains real-time processing with minimal future context (57.5 ms).
Lower speaker similarity (65%) compared to state-of-the-art systems.
Abstract
This paper presents AC-VC (Almost Causal Voice Conversion), a phonetic posteriorgrams based voice conversion system that can perform any-to-many voice conversion while having only 57.5 ms future look-ahead. The complete system is composed of three neural networks trained separately with non-parallel data. While most of the current voice conversion systems focus primarily on quality irrespective of algorithmic latency, this work elaborates on designing a method using a minimal amount of future context thus allowing a future real-time implementation. According to a subjective listening test organized in this work, the proposed AC-VC system achieves parity with the non-causal ASR-TTS baseline of the Voice Conversion Challenge 2020 in naturalness with a MOS of 3.5. In contrast, the results indicate that missing future context impacts speaker similarity. Obtained similarity percentage of 65%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
