Multi-Task Transformer for Explainable Speech Deepfake Detection via Formant Modeling

Viola Negroni; Luca Cuccovillo; Paolo Bestagini; Patrick Aichroth; Stefano Tubaro

arXiv:2601.14850·cs.SD·January 23, 2026

Multi-Task Transformer for Explainable Speech Deepfake Detection via Formant Modeling

Viola Negroni, Luca Cuccovillo, Paolo Bestagini, Patrick Aichroth, Stefano Tubaro

PDF

Open Access

TL;DR

This paper presents a multi-task transformer model for speech deepfake detection that predicts formant and voicing patterns, offering improved interpretability and efficiency over previous models while maintaining high accuracy.

Contribution

The authors introduce a streamlined multi-task transformer with enhanced explainability and faster training, advancing deepfake detection by integrating formant and voicing analysis.

Findings

01

Fewer parameters and faster training compared to baseline.

02

Improved interpretability without loss of accuracy.

03

Effective classification of real vs. fake speech.

Abstract

In this work, we introduce a multi-task transformer for speech deepfake detection, capable of predicting formant trajectories and voicing patterns over time, ultimately classifying speech as real or fake, and highlighting whether its decisions rely more on voiced or unvoiced regions. Building on a prior speaker-formant transformer architecture, we streamline the model with an improved input segmentation strategy, redesign the decoding process, and integrate built-in explainability. Compared to the baseline, our model requires fewer parameters, trains faster, and provides better interpretability, without sacrificing prediction performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Generative Adversarial Networks and Image Synthesis · Emotion and Mood Recognition