VoxAnchor: Grounding Speech Authenticity in Throat Vibration via mmWave Radar
Mingda Han, Huanqi Yang, Chaoqun Li, Wenhao Li, Guoming Zhang, Yanni Yang, Yetong Cao, Weitao Xu, Pengfei Hu

TL;DR
VoxAnchor leverages millimeter-wave radar to detect speech forgeries by analyzing throat vibrations, providing a physiologically grounded, fine-grained authentication method that outperforms existing techniques.
Contribution
The paper introduces VoxAnchor, a novel system that physically grounds speech authentication in throat vibrations using radar, enabling robust, word-level forgery detection.
Findings
Achieves an overall EER of 0.017 in forgery detection.
Effectively detects diverse forgeries including editing, splicing, replay, and deepfake.
Operates with low latency and modest computational cost.
Abstract
Rapid advances in speech synthesis and audio editing have made realistic forgeries increasingly accessible, yet existing detection methods remain vulnerable to tampering or depend on visual/wearable sensors. In this paper, we present VoxAnchor, a system that physically grounds audio authentication in vocal dynamics by leveraging the inherent coherence between speech acoustics and radar-sensed throat vibrations. VoxAnchor uses contactless millimeter-wave radar to capture fine-grained throat vibrations that are tightly coupled with human speech production, establishing a hard-to-forge anchor rooted in human physiology. The design comprises three main components: (1) a cross-modal frame-work that uses modality-specific encoders and contrastive learning to detect subtle mismatches at word granularity; (2) a phase-aware pipeline that extracts physically consistent, temporally faithful throat…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
