What You Read Isn't What You Hear: Linguistic Sensitivity in Deepfake Speech Detection

Binh Nguyen; Shuji Shi; Ryan Ofman; Thai Le

arXiv:2505.17513·cs.LG·May 26, 2025

What You Read Isn't What You Hear: Linguistic Sensitivity in Deepfake Speech Detection

Binh Nguyen, Shuji Shi, Ryan Ofman, Thai Le

PDF

1 Video

TL;DR

This paper reveals that linguistic variations can significantly undermine deepfake speech detectors, exposing vulnerabilities that necessitate incorporating linguistic robustness into anti-spoofing systems.

Contribution

It introduces transcript-level adversarial attacks to evaluate linguistic sensitivity in speech deepfake detectors, highlighting vulnerabilities overlooked by acoustic-focused defenses.

Findings

01

Linguistic perturbations reduce detection accuracy significantly.

02

Commercial detectors' accuracy drops from 100% to 32% under attack.

03

Both linguistic complexity and audio embedding similarity influence detector vulnerability.

Abstract

Recent advances in text-to-speech technologies have enabled realistic voice generation, fueling audio-based deepfake attacks such as fraud and impersonation. While audio anti-spoofing systems are critical for detecting such threats, prior work has predominantly focused on acoustic-level perturbations, leaving the impact of linguistic variation largely unexplored. In this paper, we investigate the linguistic sensitivity of both open-source and commercial anti-spoofing detectors by introducing transcript-level adversarial attacks. Our extensive evaluation reveals that even minor linguistic perturbations can significantly degrade detection accuracy: attack success rates surpass 60% on several open-source detector-voice pairs, and notably one commercial detection accuracy drops from 100% on synthetic audio to just 32%. Through a comprehensive feature attribution analysis, we identify that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

What You Read Isn't What You Hear: Linguistic Sensitivity in Deepfake Speech Detection· underline