Spectro-Temporal Modulation Representation Framework for Human-Imitated Speech Detection

Khalid Zaman; Masashi Unoki

arXiv:2604.23241·cs.SD·April 28, 2026

Spectro-Temporal Modulation Representation Framework for Human-Imitated Speech Detection

Khalid Zaman, Masashi Unoki

PDF

TL;DR

This paper introduces an auditory perception-inspired spectro-temporal modulation framework for detecting human-imitated speech, achieving high accuracy and surpassing human perception in some cases.

Contribution

The study proposes a novel STM representation framework based on cochlear filterbank models for improved detection of human-imitated speech.

Findings

01

STM representations effectively detect human-imitated speech.

02

Segmental-STM surpasses human perceptual performance.

03

The approach enhances voice authentication robustness.

Abstract

Human-imitated speech poses a greater challenge than AI-generated speech for both human listeners and automatic detection systems. Unlike AI-generated speech, which often contains artifacts, over-smoothed spectra, or robotic cues, imitated speech is produced naturally by humans, thereby preserving a higher degree of naturalness that makes imitation-based speech forgery significantly more challenging to detect using conventional acoustic or cepstral features. To overcome this challenge, this study proposes an auditory perception-based Spectro-Temporal Modulation (STM) representation framework for human-imitated speech detection. The STM representations are derived from two cochlear filterbank models: the Gammatone Filterbank (GTFB), which simulates frequency selectivity and can be regarded as a first approximation of cochlear filtering, and the Gammachirp Filterbank (GCFB), which further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.