Joint Fullband-Subband Modeling for High-Resolution SingFake Detection
Xuanjun Chen, Chia-Yu Hu, Sung-Feng Huang, Haibin Wu, Hung-yi Lee, and Jyh-Shing Roger Jang

TL;DR
This paper introduces a joint fullband-subband modeling approach using high-resolution 44.1 kHz audio to improve the detection of singing voice deepfakes, outperforming traditional methods.
Contribution
It is the first to systematically analyze high-resolution audio for SingFake detection and combines global and fine-grained spectral features for improved accuracy.
Findings
High-frequency subbands provide essential cues for detection.
High-resolution audio significantly improves detection performance.
The proposed framework outperforms 16 kHz models on WildSVDD dataset.
Abstract
Rapid advances in singing voice synthesis have increased unauthorized imitation risks, creating an urgent need for better Singing Voice Deepfake (SingFake) Detection, also known as SVDD. Unlike speech, singing contains complex pitch, wide dynamic range, and timbral variations. Conventional 16 kHz-sampled detectors prove inadequate, as they discard vital high-frequency information. This study presents the first systematic analysis of high-resolution (44.1 kHz sampling rate) audio for SVDD. We propose a joint fullband-subband modeling framework: the fullband captures global context, while subband-specific experts isolate fine-grained synthesis artifacts unevenly distributed across the spectrum. Experiments on the WildSVDD dataset demonstrate that high-frequency subbands provide essential complementary cues. Our framework significantly outperforms 16 kHz-sampled models, proving that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
