Emotion and Acoustics Should Agree: Cross-Level Inconsistency Analysis for Audio Deepfake Detection
Jinhua Zhang, Zhenqi Jia, Rui Liu

TL;DR
This paper introduces EAI-ADD, a novel audio deepfake detection method that leverages cross-level inconsistencies between emotional and acoustic features to improve spoof speech detection accuracy.
Contribution
The paper proposes a new approach that models cross-level emotion-acoustic inconsistencies, addressing limitations of previous correlation-based methods in audio deepfake detection.
Findings
EAI-ADD outperforms baseline methods on ASVspoof datasets.
Cross-level emotion-acoustic inconsistency is an effective detection signal.
The method captures subtle desynchronizations missed by prior approaches.
Abstract
Audio Deepfake Detection (ADD) aims to detect spoof speech from bonafide speech. Most prior studies assume that stronger correlations within or across acoustic and emotional features imply authenticity, and thus focus on enhancing or measuring such correlations. However, existing methods often treat acoustic and emotional features in isolation or rely on correlation metrics, which overlook subtle desynchronization between them and smooth out abrupt discontinuities. To address these issues, we propose EAI-ADD, which treats cross level emotion acoustic inconsistency as the primary detection signal. We first project emotional and acoustic representations into a comparable space. Then we progressively integrate frame level and utterance level emotion features with acoustic features to capture cross level emotion acoustic inconsistencies across different temporal granularities. Experimental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Digital Media Forensic Detection
