Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated Voice
Shubham Gupta, Mirco Ravanelli, Pascal Germain, Cem Subakan

TL;DR
This paper introduces Phoneme Discretized Saliency Maps (PDSM), a novel method leveraging phoneme boundaries to improve explainability and faithfulness in detecting AI-generated voices across different TTS systems.
Contribution
The paper presents a new discretization algorithm for saliency maps that enhances explanation faithfulness and interpretability by integrating phoneme boundary information.
Findings
PDSM produces more faithful explanations than standard methods.
Saliency maps linked to phonemes are more understandable.
Effective across multiple TTS systems.
Abstract
In this paper, we propose Phoneme Discretized Saliency Maps (PDSM), a discretization algorithm for saliency maps that takes advantage of phoneme boundaries for explainable detection of AI-generated voice. We experimentally show with two different Text-to-Speech systems (i.e., Tacotron2 and Fastspeech2) that the proposed algorithm produces saliency maps that result in more faithful explanations compared to standard posthoc explanation methods. Moreover, by associating the saliency maps to the phoneme representations, this methodology generates explanations that tend to be more understandable than standard saliency maps on magnitude spectrograms.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
