TL;DR
This paper introduces Listening Deepfake Detection (LDD), a new task focusing on detecting forgeries in listening scenarios, with a novel dataset and a motion-aware, audio-guided model that outperforms existing methods.
Contribution
It presents the first dataset for listening deepfake detection and proposes MANet, a novel model that captures motion inconsistencies guided by audio semantics.
Findings
Existing speaking deepfake detectors perform poorly on listening scenarios.
MANet significantly outperforms existing models on the ListenForge dataset.
The study emphasizes the importance of multimodal analysis beyond speaking-centric deepfake detection.
Abstract
Existing deepfake detection research has primarily focused on scenarios where the manipulated subject is actively speaking, i.e., generating fabricated content by altering the speaker's appearance or voice. However, in realistic interaction settings, attackers often alternate between falsifying speaking and listening states to mislead their targets, thereby enhancing the realism and persuasiveness of the scenario. Although the detection of 'listening deepfakes' remains largely unexplored and is hindered by a scarcity of both datasets and methodologies, the relatively limited quality of synthesized listening reactions presents an excellent breakthrough opportunity for current deepfake detection efforts. In this paper, we present the task of Listening Deepfake Detection (LDD). We introduce ListenForge, the first dataset specifically designed for this task, constructed using five Listening…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
