AccentBox: Towards High-Fidelity Zero-Shot Accent Generation
Jinzuomu Zhong, Korin Richmond, Zhiba Su, Siqi Sun

TL;DR
This paper introduces AccentBox, a two-stage zero-shot accent generation system that improves accent fidelity and control in TTS by combining accent identification and speaker-agnostic accent embeddings.
Contribution
It unifies FAC, accented TTS, and ZS-TTS into a novel pipeline with state-of-the-art accent identification and enhanced accent fidelity in zero-shot scenarios.
Findings
Achieves 0.56 F1 score on unseen speakers for accent identification.
Outperforms previous methods in accent fidelity for zero-shot accent generation.
Enables generation of unseen accents with high fidelity.
Abstract
While recent Zero-Shot Text-to-Speech (ZS-TTS) models have achieved high naturalness and speaker similarity, they fall short in accent fidelity and control. To address this issue, we propose zero-shot accent generation that unifies Foreign Accent Conversion (FAC), accented TTS, and ZS-TTS, with a novel two-stage pipeline. In the first stage, we achieve state-of-the-art (SOTA) on Accent Identification (AID) with 0.56 f1 score on unseen speakers. In the second stage, we condition a ZS-TTS system on the pretrained speaker-agnostic accent embeddings extracted by the AID model. The proposed system achieves higher accent fidelity on inherent/cross accent generation, and enables unseen accent generation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Advanced Optical Sensing Technologies · Neural Networks and Reservoir Computing
