Multi-Speaker Conversational Audio Deepfake: Taxonomy, Dataset and Pilot Study
Alabi Ahmed, Vandana Janeja, Sanjay Purushotham

TL;DR
This paper introduces a taxonomy, a new dataset, and baseline benchmarks for multi-speaker conversational audio deepfakes, addressing a critical gap in research on detecting synthetic multi-speaker dialogues.
Contribution
It presents the first comprehensive taxonomy, a publicly available dataset (MsCADD), and baseline model evaluations for multi-speaker conversational deepfake detection.
Findings
Baseline models show a significant detection gap in multi-speaker deepfakes.
MsCADD dataset contains 2,830 real and synthetic two-speaker conversation clips.
Benchmark results highlight the need for improved detection methods.
Abstract
The rapid advances in text-to-speech (TTS) technologies have made audio deepfakes increasingly realistic and accessible, raising significant security and trust concerns. While existing research has largely focused on detecting single-speaker audio deepfakes, real-world malicious applications with multi-speaker conversational settings is also emerging as a major underexplored threat. To address this gap, we propose a conceptual taxonomy of multi-speaker conversational audio deepfakes, distinguishing between partial manipulations (one or multiple speakers altered) and full manipulations (entire conversations synthesized). As a first step, we introduce a new Multi-speaker Conversational Audio Deepfakes Dataset (MsCADD) of 2,830 audio clips containing real and fully synthetic two-speaker conversations, generated using VITS and SoundStorm-based NotebookLM models to simulate natural dialogue…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Adversarial Robustness in Machine Learning · Emotion and Mood Recognition
