Easy, Interpretable, Effective: openSMILE for voice deepfake detection

Octavian Pascu; Dan Oneata; Horia Cucu; Nicolas M. M\"uller

arXiv:2408.15775·eess.AS·August 30, 2024

Easy, Interpretable, Effective: openSMILE for voice deepfake detection

Octavian Pascu, Dan Oneata, Horia Cucu, Nicolas M. M\"uller

PDF

Open Access

TL;DR

This paper shows that simple, interpretable features from the openSMILE library can effectively detect voice deepfakes in standard datasets, achieving low error rates and revealing insights into attack signatures.

Contribution

It introduces a lightweight, human-interpretable feature set for voice deepfake detection that outperforms complex models and provides understanding of attack signatures.

Findings

01

Achieves up to 0.8% EER on deepfake detection

02

Simple features transfer across similar TTS attack types

03

Provides insights into TTS fingerprinting for anti-spoofing

Abstract

In this paper, we demonstrate that attacks in the latest ASVspoof5 dataset -- a de facto standard in the field of voice authenticity and deepfake detection -- can be identified with surprising accuracy using a small subset of very simplistic features. These are derived from the openSMILE library, and are scalar-valued, easy to compute, and human interpretable. For example, attack A10`s unvoiced segments have a mean length of 0.09 +- 0.02, while bona fide instances have a mean length of 0.18 +- 0.07. Using this feature alone, a threshold classifier achieves an Equal Error Rate (EER) of 10.3% for attack A10. Similarly, across all attacks, we achieve up to 0.8% EER, with an overall EER of 15.7 +- 6.0%. We explore the generalization capabilities of these features and find that some of them transfer effectively between attacks, primarily when the attacks originate from similar Text-to-Speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis