Multi-Task Learning in Utterance-Level and Segmental-Level Spoof Detection
Lin Zhang, Xin Wang, Erica Cooper, Junichi Yamagishi

TL;DR
This paper introduces a multi-task learning framework using an enhanced CNN for simultaneous spoofing detection at segmental and utterance levels, demonstrating improved performance over single-task models.
Contribution
The paper proposes a novel multi-task learning approach with a specialized SELCNN architecture and training strategies for improved spoof detection at multiple levels.
Findings
Multi-task models outperform single-task models.
Binary-branch architecture better utilizes multi-level information.
Fine-tuning with warm-up models yields superior results.
Abstract
In this paper, we provide a series of multi-tasking benchmarks for simultaneously detecting spoofing at the segmental and utterance levels in the PartialSpoof database. First, we propose the SELCNN network, which inserts squeeze-and-excitation (SE) blocks into a light convolutional neural network (LCNN) to enhance the capacity of hidden feature selection. Then, we implement multi-task learning (MTL) frameworks with SELCNN followed by bidirectional long short-term memory (Bi-LSTM) as the basic model. We discuss MTL in PartialSpoof in terms of architecture (uni-branch/multi-branch) and training strategies (from-scratch/warm-up) step-by-step. Experiments show that the multi-task model performs relatively better than single-task models. Also, in MTL, a binary-branch architecture more adequately utilizes information from two levels than a uni-branch model. For the binary-branch architecture,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
