Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper

Hoan My Tran; Xin Wang; Wanying Ge; Xuechen Liu; Junichi Yamagishi

arXiv:2602.22658·eess.AS·March 3, 2026

Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper

Hoan My Tran, Xin Wang, Wanying Ge, Xuechen Liu, Junichi Yamagishi

PDF

Open Access

TL;DR

This paper introduces a cost-effective method to detect deepfake words in speech by fine-tuning a pre-trained Whisper model for next-token prediction, demonstrating promising results in in-domain and out-of-domain scenarios.

Contribution

The study proposes a novel approach of fine-tuning Whisper for synthetic word detection, reducing data collection costs and maintaining competitive performance.

Findings

01

Low detection error rates on in-domain data

02

Comparable performance to dedicated models on out-of-domain data

03

Performance degradation on unseen speech-generative models

Abstract

Deepfake speech utterances can be forged by replacing one or more words in a bona fide utterance with semantically different words synthesized with speech-generative models. While a dedicated synthetic word detector could be developed, we developed a cost-effective method that fine-tunes a pre-trained Whisper model to detect synthetic words while transcribing the input utterance via next-token prediction. We further investigate using partially vocoded utterances as the fine-tuning data, thus reducing the cost of data collection. Our experiments demonstrate that, on in-domain test data, the fine-tuned Whisper yields low synthetic-word detection error rates and transcription error rates. On out-of-domain test data with synthetic words produced with unseen speech-generative models, the fine-tuned Whisper remains on par with a dedicated ResNet-based detection model; however, the overall…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Topic Modeling