Context and Transcripts Improve Detection of Deepfake Audios of Public Figures

Chongyang Gao; Marco Postiglione; Julian Baldwin; Natalia Denisenko; Isabel Gortner; Luke Fosdick; Chiara Pulice; Sarit Kraus; V.S. Subrahmanian

arXiv:2601.13464·cs.AI·January 21, 2026

Context and Transcripts Improve Detection of Deepfake Audios of Public Figures

Chongyang Gao, Marco Postiglione, Julian Baldwin, Natalia Denisenko, Isabel Gortner, Luke Fosdick, Chiara Pulice, Sarit Kraus, V.S. Subrahmanian

PDF

Open Access

TL;DR

This paper demonstrates that incorporating context and transcripts significantly enhances deepfake audio detection accuracy, introduces a novel architecture called CADD, and provides a new dataset for public figure deepfakes.

Contribution

The paper introduces CADD, a context-based audio deepfake detector, and a new dataset of public figure deepfakes, showing improved robustness and performance over existing methods.

Findings

01

Context and transcripts improve detection efficacy by up to 37.58% in F1-score.

02

CADD is more robust to adversarial evasion strategies, with less than 1% performance degradation.

03

Performance gains are consistent across multiple datasets and baseline detectors.

Abstract

Humans use context to assess the veracity of information. However, current audio deepfake detectors only analyze the audio file without considering either context or transcripts. We create and analyze a Journalist-provided Deepfake Dataset (JDD) of 255 public deepfakes which were primarily contributed by over 70 journalists since early 2024. We also generate a synthetic audio dataset (SYN) of dead public figures and propose a novel Context-based Audio Deepfake Detector (CADD) architecture. In addition, we evaluate performance on two large-scale datasets: ITW and P $^{2}$ V. We show that sufficient context and/or the transcript can significantly improve the efficacy of audio deepfake detectors. Performance (measured via F1 score, AUC, and EER) of multiple baseline audio deepfake detectors and traditional classifiers can be improved by 5%-37.58% in F1-score, 3.77%-42.79% in AUC, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Music and Audio Processing