MimicLM: Zero-Shot Voice Imitation through Autoregressive Modeling of Pseudo-Parallel Speech Corpora

Tao Feng; Yuxiang Wang; Yuancheng Wang; Xueyao Zhang; Dekun Chen; Chaoren Wang; Xun Guan; Zhizheng Wu

arXiv:2604.11552·cs.SD·April 21, 2026

MimicLM: Zero-Shot Voice Imitation through Autoregressive Modeling of Pseudo-Parallel Speech Corpora

Tao Feng, Yuxiang Wang, Yuancheng Wang, Xueyao Zhang, Dekun Chen, Chaoren Wang, Xun Guan, Zhizheng Wu

PDF

TL;DR

MimicLM introduces a novel zero-shot voice imitation method using synthetic speech for training, combining interleaved text-audio modeling and preference alignment to outperform existing approaches in naturalness and speaker similarity.

Contribution

The paper presents MimicLM, a new approach that trains on synthetic speech and real targets, overcoming data scarcity and synthetic quality limitations in voice imitation.

Findings

01

MimicLM achieves higher naturalness than existing methods.

02

It maintains competitive speaker similarity scores.

03

The approach effectively leverages synthetic data for training.

Abstract

Voice imitation aims to transform source speech to match a reference speaker's timbre and speaking style while preserving linguistic content. A straightforward approach is to train on triplets of (source, reference, target), where source and target share the same content but target matches the reference's voice characteristics, yet such data is extremely scarce. Existing approaches either employ carefully designed disentanglement architectures to bypass this data scarcity or leverage external systems to synthesize pseudo-parallel training data. However, the former requires intricate model design, and the latter faces a quality ceiling when synthetic speech is used as training targets. To address these limitations, we propose MimicLM, which takes a novel approach by using synthetic speech as training sources while retaining real recordings as targets. This design enables the model to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.