A Novel Automatic Framework for Speaker Drift Detection in Synthesized Speech

Jia-Hong Huang; Seulgi Kim; Yi Chieh Liu; Yixian Shen; Hongyi Zhu; Prayag Tiwari; Stevan Rudinac; Evangelos Kanoulas

arXiv:2604.06327·cs.SD·April 9, 2026

A Novel Automatic Framework for Speaker Drift Detection in Synthesized Speech

Jia-Hong Huang, Seulgi Kim, Yi Chieh Liu, Yixian Shen, Hongyi Zhu, Prayag Tiwari, Stevan Rudinac, Evangelos Kanoulas

PDF

TL;DR

This paper introduces an automatic framework for detecting speaker drift in synthesized speech, combining cosine similarity analysis with large language models to improve coherence in long-form TTS outputs.

Contribution

It presents the first automated method for speaker drift detection using geometric analysis and LLM reasoning, supported by a new synthetic benchmark with human annotations.

Findings

01

Cosine similarity effectively captures speaker consistency within utterances.

02

LLMs can assess speaker drift based on structured speech representations.

03

The framework outperforms baseline methods in detecting speaker drift.

Abstract

Recent diffusion-based text-to-speech (TTS) models achieve high naturalness and expressiveness, yet often suffer from speaker drift, a subtle, gradual shift in perceived speaker identity within a single utterance. This underexplored phenomenon undermines the coherence of synthetic speech, especially in long-form or interactive settings. We introduce the first automatic framework for detecting speaker drift by formulating it as a binary classification task over utterance-level speaker consistency. Our method computes cosine similarity across overlapping segments of synthesized speech and prompts large language models (LLMs) with structured representations to assess drift. We provide theoretical guarantees for cosine-based drift detection and demonstrate that speaker embeddings exhibit meaningful geometric clustering on the unit sphere. To support evaluation, we construct a high-quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.