Detection of Cross-Dataset Fake Audio Based on Prosodic and   Pronunciation Features

Chenglong Wang; Jiangyan Yi; Jianhua Tao; Chuyuan Zhang; Shuai Zhang; and Xun Chen

arXiv:2305.13700·cs.SD·May 24, 2023·1 cites

Detection of Cross-Dataset Fake Audio Based on Prosodic and Pronunciation Features

Chenglong Wang, Jiangyan Yi, Jianhua Tao, Chuyuan Zhang, Shuai Zhang, and Xun Chen

PDF

Open Access

TL;DR

This paper introduces a multi-view feature approach combining prosodic, pronunciation, and wav2vec features with attention mechanisms to improve the cross-dataset generalization of fake audio detection systems.

Contribution

It proposes a novel multi-view feature extraction method and fusion strategy that enhances fake audio detection performance across different datasets.

Findings

01

Significant performance improvements in cross-dataset experiments

02

Effective fusion of prosodic, pronunciation, and wav2vec features

03

Enhanced generalization of fake audio detection models

Abstract

Existing fake audio detection systems perform well in in-domain testing, but still face many challenges in out-of-domain testing. This is due to the mismatch between the training and test data, as well as the poor generalizability of features extracted from limited views. To address this, we propose multi-view features for fake audio detection, which aim to capture more generalized features from prosodic, pronunciation, and wav2vec dimensions. Specifically, the phoneme duration features are extracted from a pre-trained model based on a large amount of speech data. For the pronunciation features, a Conformer-based phoneme recognition model is first trained, keeping the acoustic encoder part as a deeply embedded feature extractor. Furthermore, the prosodic and pronunciation features are fused with wav2vec features based on an attention mechanism to improve the generalization of fake audio…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Digital Media Forensic Detection