Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement

Tuan-Nam Nguyen; Ngoc-Quan Pham; Seymanur Akti; Alexander Waibel

arXiv:2506.16580·cs.CL·June 23, 2025

Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement

Tuan-Nam Nguyen, Ngoc-Quan Pham, Seymanur Akti, Alexander Waibel

PDF

Open Access

TL;DR

This paper introduces the first streaming accent conversion model that transforms non-native speech into a native-like accent in real-time, preserving speaker identity and prosody, and improving pronunciation.

Contribution

It presents a novel streaming AC architecture using Emformer encoder and optimized inference, enabling real-time accent conversion with stable latency.

Findings

01

Achieves performance comparable to top non-streaming AC models

02

Maintains stable latency during streaming processing

03

Effectively improves pronunciation and preserves speaker identity

Abstract

We propose a first streaming accent conversion (AC) model that transforms non-native speech into a native-like accent while preserving speaker identity, prosody and improving pronunciation. Our approach enables stream processing by modifying a previous AC architecture with an Emformer encoder and an optimized inference mechanism. Additionally, we integrate a native text-to-speech (TTS) model to generate ideal ground-truth data for efficient training. Our streaming AC model achieves comparable performance to the top AC models while maintaining stable latency, making it the first AC system capable of streaming.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Natural Language Processing Techniques