Source Tracing of Synthetic Speech Systems Through Paralinguistic Pre-Trained Representations
Girish, Mohd Mujtaba Akhtar, Orchid Chetia Phukan, Drishti Singh, Swarup Ranjan Behera, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma

TL;DR
This paper introduces TRIO, a novel framework that fuses paralinguistic and speaker recognition pre-trained speech representations to improve source tracing of synthetic speech systems, achieving state-of-the-art results.
Contribution
It proposes a new fusion method combining paralinguistic and speaker recognition models with a gated mechanism and CCA loss for better source attribution in synthetic speech.
Findings
TRIO outperforms individual SPTMs and baseline fusion methods.
Fusing TRILLsson and x-vector improves source tracing accuracy.
The approach sets new state-of-the-art in synthetic speech source tracing.
Abstract
In this work, we focus on source tracing of synthetic speech generation systems (STSGS). Each source embeds distinctive paralinguistic features--such as pitch, tone, rhythm, and intonation--into their synthesized speech, reflecting the underlying design of the generation model. While previous research has explored representations from speech pre-trained models (SPTMs), the use of representations from SPTM pre-trained for paralinguistic speech processing, which excel in paralinguistic tasks like synthetic speech detection, speech emotion recognition has not been investigated for STSGS. We hypothesize that representations from paralinguistic SPTM will be more effective due to its ability to capture source-specific paralinguistic cues attributing to its paralinguistic pre-training. Our comparative study of representations from various SOTA SPTMs, including paralinguistic, monolingual,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
MethodsFocus
