A Parameter-Efficient Multi-Scale Convolutional Adapter for Synthetic Speech Detection
Yassine El Kheir, Fabian Ritter-Guttierez, Arnab Das, Tim Polzehl, Sebastian M\"oller

TL;DR
This paper presents MultiConvAdapter, a parameter-efficient architecture that enhances synthetic speech detection by capturing multi-scale temporal artifacts with minimal additional parameters, outperforming existing fine-tuning methods.
Contribution
The paper introduces MultiConvAdapter, a novel multi-scale convolutional adapter that efficiently models temporal artifacts in speech, reducing computational costs while improving detection accuracy.
Findings
Achieves superior performance on five datasets.
Uses only 3.17 million trainable parameters.
Outperforms full fine-tuning and existing PEFT methods.
Abstract
Recent synthetic speech detection models typically adapt a pre-trained SSL model via finetuning, which is computationally demanding. Parameter-Efficient Fine-Tuning (PEFT) offers an alternative. However, existing methods lack the specific inductive biases required to model the multi-scale temporal artifacts characteristic of spoofed audio. This paper introduces the Multi-Scale Convolutional Adapter (MultiConvAdapter), a parameter-efficient architecture designed to address this limitation. MultiConvAdapter integrates parallel convolutional modules within the SSL encoder, facilitating the simultaneous learning of discriminative features across multiple temporal resolutions, capturing both short-term artifacts and long-term distortions. With only M trainable parameters ( of the SSL backbone), MultiConvAdapter substantially reduces the computational burden of adaptation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
