Human perception of audio deepfakes: the role of language and speaking style

Eugenia San Segundo; Aurora L\'opez-Jare\~no; Xin Wang; Junichi Yamagishi

arXiv:2512.09221·eess.AS·December 11, 2025

Human perception of audio deepfakes: the role of language and speaking style

Eugenia San Segundo, Aurora L\'opez-Jare\~no, Xin Wang, Junichi Yamagishi

PDF

Open Access

TL;DR

This study investigates how native Spanish and Japanese listeners perceive audio deepfakes, revealing that they rely mainly on prosodic cues like intonation and rhythm, with an average detection accuracy of about 59%.

Contribution

It provides new cross-linguistic insights into perceptual cues used to identify audio deepfakes, highlighting the role of suprasegmental features and cultural differences.

Findings

01

Average detection accuracy of 59.11%

02

Listeners rely on prosodic cues over segmental features

03

Cross-linguistic differences in perceptual strategies

Abstract

Audio deepfakes have reached a level of realism that makes it increasingly difficult to distinguish between human and artificial voices, which poses risks such as identity theft or spread of disinformation. Despite these concerns, research on humans' ability to identify deepfakes is limited, with most studies focusing on English and very few exploring the reasons behind listeners' perceptual decisions. This study addresses this gap through a perceptual experiment in which 54 listeners (28 native Spanish speakers and 26 native Japanese speakers) classified voices as natural or synthetic, and justified their choices. The experiment included 80 stimuli (50% artificial), organized according to three variables: language (Spanish/Japanese), speech style (audiobooks/interviews), and familiarity with the voice (familiar/unfamiliar). The goal was to examine how these variables influence…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Phonetics and Phonology Research · Multisensory perception and integration