Multi-Task Transformer for Explainable Speech Deepfake Detection via Formant Modeling
Viola Negroni, Luca Cuccovillo, Paolo Bestagini, Patrick Aichroth, Stefano Tubaro

TL;DR
This paper presents a multi-task transformer model for speech deepfake detection that predicts formant and voicing patterns, offering improved interpretability and efficiency over previous models while maintaining high accuracy.
Contribution
The authors introduce a streamlined multi-task transformer with enhanced explainability and faster training, advancing deepfake detection by integrating formant and voicing analysis.
Findings
Fewer parameters and faster training compared to baseline.
Improved interpretability without loss of accuracy.
Effective classification of real vs. fake speech.
Abstract
In this work, we introduce a multi-task transformer for speech deepfake detection, capable of predicting formant trajectories and voicing patterns over time, ultimately classifying speech as real or fake, and highlighting whether its decisions rely more on voiced or unvoiced regions. Building on a prior speaker-formant transformer architecture, we streamline the model with an improved input segmentation strategy, redesign the decoding process, and integrate built-in explainability. Compared to the baseline, our model requires fewer parameters, trains faster, and provides better interpretability, without sacrificing prediction performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Generative Adversarial Networks and Image Synthesis · Emotion and Mood Recognition
