Accent Normalization Using Self-Supervised Discrete Tokens with Non-Parallel Data

Qibing Bai; Sho Inoue; Shuai Wang; Zhongjie Jiang; Yannan Wang; Haizhou Li

arXiv:2507.17735·eess.AS·July 24, 2025

Accent Normalization Using Self-Supervised Discrete Tokens with Non-Parallel Data

Qibing Bai, Sho Inoue, Shuai Wang, Zhongjie Jiang, Yannan Wang, Haizhou Li

PDF

Open Access

TL;DR

This paper introduces a self-supervised, non-parallel accent normalization method that converts accented speech into native-like speech while maintaining speaker identity, showing improved naturalness and accent reduction.

Contribution

It presents a novel pipeline using self-supervised discrete tokens and flow matching, advancing accent normalization without requiring parallel data.

Findings

01

Outperforms frame-to-frame baseline in naturalness

02

Reduces accentedness effectively across multiple English accents

03

Preserves speaker timbre during normalization

Abstract

Accent normalization converts foreign-accented speech into native-like speech while preserving speaker identity. We propose a novel pipeline using self-supervised discrete tokens and non-parallel training data. The system extracts tokens from source speech, converts them through a dedicated model, and synthesizes the output using flow matching. Our method demonstrates superior performance over a frame-to-frame baseline in naturalness, accentedness reduction, and timbre preservation across multiple English accents. Through token-level phonetic analysis, we validate the effectiveness of our token-based approach. We also develop two duration preservation methods, suitable for applications such as dubbing.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications