ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks

Aurosweta Mahapatra; Ismail Rasim Ulgen; Kong Aik Lee; Nicholas Andrews; Berrak Sisman

arXiv:2604.13229·eess.AS·April 16, 2026

ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks

Aurosweta Mahapatra, Ismail Rasim Ulgen, Kong Aik Lee, Nicholas Andrews, Berrak Sisman

PDF

TL;DR

ProSDD introduces a two-stage prosodic learning framework that enhances speech deepfake detection, especially against expressive and emotional attacks, by modeling natural speech variability.

Contribution

It proposes a novel supervised masked prediction approach for prosodic features, improving generalization over existing methods in speech deepfake detection.

Findings

01

Reduces ASVspoof 2024 EER from 25.43% to 16.14%.

02

Achieves 50% relative reduction on EmoFake and EmoSpoof-TTS.

03

Outperforms baseline models on multiple benchmarks.

Abstract

Speech deepfake detection (SDD) systems perform well on standard benchmarks datasets but often fail to generalize to expressive and emotional spoofing attacks. Many methods rely on spoof-heavy training data, learning dataset-specific artifacts rather than transferable cues of natural speech. In contrast, humans internalize variability in real speech and detect fakes as deviations from it. We introduce ProSDD, a two-stage framework that enriches model embeddings through supervised masked prediction of speaker-conditioned prosodic variation based on pitch, voice activity, and energy. Stage I learns prosodic variability from real speech, and Stage II jointly optimizes this objective with spoof classification. ProSDD consistently outperforms baselines under both ASVspoof 2019 and 2024 training, reducing ASVspoof 2024 EER from 25.43% to 16.14% (2019-trained) and from 39.62% to 7.38%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.