The Affective Bridge: Preserving Speech Representations while Enhancing Deepfake Detection vian emotional Constraints
Yupei Li, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang, Bj\"orn W. Schuller

TL;DR
This paper introduces an emotion-guided training framework for speech deepfake detection that enhances discriminative cues while preserving original speech semantics, leading to improved detection accuracy.
Contribution
It proposes a novel, feature-agnostic, and non-destructive training method using emotion as a bridging constraint to improve speech deepfake detection.
Findings
Up to 6% accuracy improvement on FakeOrReal
Up to 2% accuracy improvement on IntheWild
Reductions in equal error rate
Abstract
Speech deepfake detection (DFD) has benefited from diverse acoustic and semantic speech representations, many of which encode valuable speech information and are costly to train. Existing approaches typically enhance DFD by tuning the representations or applying post-hoc classification on frozen features, limiting control over improving discriminative DF cues without distorting original semantics. We find that emotion is encoded across diverse speech features and correlates with DFD. Therefore, we introduce a unified, feature-agnostic, and non-destructive training framework that uses emotion as a bridging constraint to guide speech features toward DFD, treating emotion recognition as a representation alignment objective rather than an auxiliary task, while preserving the original semantic information. Experiments on FakeOrReal and IntheWild show accuracy improvements of up to 6\% and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Sentiment Analysis and Opinion Mining
