Multi-Speaker Conversational Audio Deepfake: Taxonomy, Dataset and Pilot Study

Alabi Ahmed; Vandana Janeja; Sanjay Purushotham

arXiv:2602.00295·cs.SD·February 3, 2026

Multi-Speaker Conversational Audio Deepfake: Taxonomy, Dataset and Pilot Study

Alabi Ahmed, Vandana Janeja, Sanjay Purushotham

PDF

Open Access

TL;DR

This paper introduces a taxonomy, a new dataset, and baseline benchmarks for multi-speaker conversational audio deepfakes, addressing a critical gap in research on detecting synthetic multi-speaker dialogues.

Contribution

It presents the first comprehensive taxonomy, a publicly available dataset (MsCADD), and baseline model evaluations for multi-speaker conversational deepfake detection.

Findings

01

Baseline models show a significant detection gap in multi-speaker deepfakes.

02

MsCADD dataset contains 2,830 real and synthetic two-speaker conversation clips.

03

Benchmark results highlight the need for improved detection methods.

Abstract

The rapid advances in text-to-speech (TTS) technologies have made audio deepfakes increasingly realistic and accessible, raising significant security and trust concerns. While existing research has largely focused on detecting single-speaker audio deepfakes, real-world malicious applications with multi-speaker conversational settings is also emerging as a major underexplored threat. To address this gap, we propose a conceptual taxonomy of multi-speaker conversational audio deepfakes, distinguishing between partial manipulations (one or multiple speakers altered) and full manipulations (entire conversations synthesized). As a first step, we introduce a new Multi-speaker Conversational Audio Deepfakes Dataset (MsCADD) of 2,830 audio clips containing real and fully synthetic two-speaker conversations, generated using VITS and SoundStorm-based NotebookLM models to simulate natural dialogue…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Adversarial Robustness in Machine Learning · Emotion and Mood Recognition