Human perception of audio deepfakes: the role of language and speaking style
Eugenia San Segundo, Aurora L\'opez-Jare\~no, Xin Wang, Junichi Yamagishi

TL;DR
This study investigates how native Spanish and Japanese listeners perceive audio deepfakes, revealing that they rely mainly on prosodic cues like intonation and rhythm, with an average detection accuracy of about 59%.
Contribution
It provides new cross-linguistic insights into perceptual cues used to identify audio deepfakes, highlighting the role of suprasegmental features and cultural differences.
Findings
Average detection accuracy of 59.11%
Listeners rely on prosodic cues over segmental features
Cross-linguistic differences in perceptual strategies
Abstract
Audio deepfakes have reached a level of realism that makes it increasingly difficult to distinguish between human and artificial voices, which poses risks such as identity theft or spread of disinformation. Despite these concerns, research on humans' ability to identify deepfakes is limited, with most studies focusing on English and very few exploring the reasons behind listeners' perceptual decisions. This study addresses this gap through a perceptual experiment in which 54 listeners (28 native Spanish speakers and 26 native Japanese speakers) classified voices as natural or synthetic, and justified their choices. The experiment included 80 stimuli (50% artificial), organized according to three variables: language (Spanish/Japanese), speech style (audiobooks/interviews), and familiarity with the voice (familiar/unfamiliar). The goal was to examine how these variables influence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Phonetics and Phonology Research · Multisensory perception and integration
