Syn-Att: Synthetic Speech Attribution via Semi-Supervised Unknown Multi-Class Ensemble of CNNs
Md Awsafur Rahman, Bishmoy Paul, Najibul Haque Sarker, Zaber Ibn Abdul, Hakim, Shaikh Anowarul Fattah, Mohammad Saquib

TL;DR
This paper introduces Syn-Att, a semi-supervised ensemble CNN approach for attributing synthetic speech to its generator, significantly improving robustness and accuracy in distinguishing among multiple synthetic speech algorithms.
Contribution
It presents a novel semi-supervised ensemble CNN method for synthetic speech attribution, enhancing robustness and generalization across different datasets.
Findings
Outperforms top methods by 12-13% on strongly perturbed data
Achieves 1-2% accuracy improvement on less perturbed data
Validated on datasets with 18,000 and 10,000 synthetic speeches
Abstract
With the huge technological advances introduced by deep learning in audio & speech processing, many novel synthetic speech techniques achieved incredible realistic results. As these methods generate realistic fake human voices, they can be used in malicious acts such as people imitation, fake news, spreading, spoofing, media manipulations, etc. Hence, the ability to detect synthetic or natural speech has become an urgent necessity. Moreover, being able to tell which algorithm has been used to generate a synthetic speech track can be of preeminent importance to track down the culprit. In this paper, a novel strategy is proposed to attribute a synthetic speech track to the generator that is used to synthesize it. The proposed detector transforms the audio into log-mel spectrogram, extracts features using CNN, and classifies it between five known and unknown algorithms, utilizing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
