Quantizer-Aware Hierarchical Neural Codec Modeling for Speech Deepfake Detection

Jinyang Wu; Zihan Pan; Qiquan Zhang; Sailor Hardik Bhupendra; Soumik Mondal

arXiv:2603.16914·cs.SD·March 19, 2026

Quantizer-Aware Hierarchical Neural Codec Modeling for Speech Deepfake Detection

Jinyang Wu, Zihan Pan, Qiquan Zhang, Sailor Hardik Bhupendra, Soumik Mondal

PDF

Open Access

TL;DR

This paper introduces a hierarchy-aware neural speech codec model that leverages quantizer-level information to improve deepfake detection, significantly reducing error rates on benchmark datasets.

Contribution

It presents a novel quantizer-level hierarchy-aware framework that enhances speech deepfake detection by utilizing discrete codec representations, with minimal additional training parameters.

Findings

01

Achieves 46.2% relative EER reduction on ASVspoof 2019

02

Achieves 13.9% relative EER reduction on ASVspoof5

03

Utilizes only 4.4% additional parameters for improved detection

Abstract

Neural audio codecs discretize speech via residual vector quantization (RVQ), forming a coarse-to-fine hierarchy across quantizers. While codec models have been explored for representation learning, their discrete structure remains underutilized in speech deepfake detection. In particular, different quantization levels capture complementary acoustic cues, where early quantizers encode coarse structure and later quantizers refine residual details that reveal synthesis artifacts. Existing systems either rely on continuous encoder features or ignore this quantizer-level hierarchy. We propose a hierarchy-aware representation learning framework that models quantizer-level contributions through learnable global weighting, enabling structured codec representations aligned with forensic cues. Keeping the speech encoder backbone frozen and updating only 4.4% additional parameters, our method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Speech Recognition and Synthesis