Deep Spectro-temporal Artifacts for Detecting Synthesized Speech
Xiaohui Liu, Meng Liu, Lin Zhang, Linjuan Zhang, Chang Zeng, Kai Li,, Nan Li, Kong Aik Lee, Longbiao Wang, Jianwu Dang

TL;DR
This paper presents a system for detecting synthesized speech artifacts using spectro-temporal features, deep embeddings, and self-supervised learning, evaluated in the Audio Deep Synthesis Detection Challenge.
Contribution
The paper introduces a multi-faceted approach combining feature fusion, domain adaptation, and self-supervised learning for improved fake audio detection.
Findings
Ranked 4th in track 1 and 5th in track 2 of the ADD Challenge.
Effective use of feature fusion and domain adaptation techniques.
Demonstrated the importance of spectro-temporal artifacts in detection.
Abstract
The Audio Deep Synthesis Detection (ADD) Challenge has been held to detect generated human-like speech. With our submitted system, this paper provides an overall assessment of track 1 (Low-quality Fake Audio Detection) and track 2 (Partially Fake Audio Detection). In this paper, spectro-temporal artifacts were detected using raw temporal signals, spectral features, as well as deep embedding features. To address track 1, low-quality data augmentation, domain adaptation via finetuning, and various complementary feature information fusion were aggregated in our system. Furthermore, we analyzed the clustering characteristics of subsystems with different features by visualization method and explained the effectiveness of our proposed greedy fusion strategy. As for track 2, frame transition and smoothing were detected using self-supervised learning structure to capture the manipulation of PF…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
