RawTFNet: A Lightweight CNN Architecture for Speech Anti-spoofing
Yang Xiao, Ting Dang, Rohan Kumar Das

TL;DR
RawTFNet is a lightweight CNN model for speech anti-spoofing that captures detailed features efficiently, achieving performance comparable to state-of-the-art models with less computational cost.
Contribution
Introduces RawTFNet, a novel lightweight CNN architecture that processes audio features along time and frequency for effective anti-spoofing.
Findings
Achieves comparable performance to state-of-the-art models on ASVspoof datasets.
Uses fewer computational resources than transformer-based models.
Demonstrates effective feature extraction for synthetic speech detection.
Abstract
Automatic speaker verification (ASV) systems are often affected by spoofing attacks. Recent transformer-based models have improved anti-spoofing performance by learning strong feature representations. However, these models usually need high computing power. To address this, we introduce RawTFNet, a lightweight CNN model designed for audio signals. The RawTFNet separates feature processing along time and frequency dimensions, which helps to capture the fine-grained details of synthetic speech. We tested RawTFNet on the ASVspoof 2021 LA and DF evaluation datasets. The results show that RawTFNet reaches comparable performance to that of the state-of-the-art models, while also using fewer computing resources. The code and models will be made publicly available.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders
