Spectro-Temporal Modulation Representation Framework for Human-Imitated Speech Detection
Khalid Zaman, Masashi Unoki

TL;DR
This paper introduces an auditory perception-inspired spectro-temporal modulation framework for detecting human-imitated speech, achieving high accuracy and surpassing human perception in some cases.
Contribution
The study proposes a novel STM representation framework based on cochlear filterbank models for improved detection of human-imitated speech.
Findings
STM representations effectively detect human-imitated speech.
Segmental-STM surpasses human perceptual performance.
The approach enhances voice authentication robustness.
Abstract
Human-imitated speech poses a greater challenge than AI-generated speech for both human listeners and automatic detection systems. Unlike AI-generated speech, which often contains artifacts, over-smoothed spectra, or robotic cues, imitated speech is produced naturally by humans, thereby preserving a higher degree of naturalness that makes imitation-based speech forgery significantly more challenging to detect using conventional acoustic or cepstral features. To overcome this challenge, this study proposes an auditory perception-based Spectro-Temporal Modulation (STM) representation framework for human-imitated speech detection. The STM representations are derived from two cochlear filterbank models: the Gammatone Filterbank (GTFB), which simulates frequency selectivity and can be regarded as a first approximation of cochlear filtering, and the Gammachirp Filterbank (GCFB), which further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
