Investigation of Synthetic Speech Detection Using Frame- and Segment-Specific Importance Weighting
Ali Khodabakhsh, Cenk Demiroglu

TL;DR
This paper proposes three algorithms that assign importance weights to speech segments, phonemes, and sound classes to improve synthetic speech detection, showing significant gains for known attacks but limited improvements for unknown attack types.
Contribution
The paper introduces novel weighting algorithms for speech segments, phonemes, and sound classes to enhance synthetic speech detection accuracy.
Findings
Significant improvement for known attack methods.
Limited improvement for unknown attack types.
Weighted scoring outperforms baseline in specific scenarios.
Abstract
Speaker verification systems are vulnerable to spoofing attacks which presents a major problem in their real-life deployment. To date, most of the proposed synthetic speech detectors (SSDs) have weighted the importance of different segments of speech equally. However, different attack methods have different strengths and weaknesses and the traces that they leave may be short or long term acoustic artifacts. Moreover, those may occur for only particular phonemes or sounds. Here, we propose three algorithms that weigh likelihood-ratio scores of individual frames, phonemes, and sound-classes depending on their importance for the SSD. Significant improvement over the baseline system has been obtained for known attack methods that were used in training the SSDs. However, improvement with unknown attack types was not substantial. Thus, the type of distortions that were caused by the unknown…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsConvolution · Non Maximum Suppression · 1x1 Convolution · SSD
