Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic   Speech Detection

Duc-Tuan Truong; Ruijie Tao; Tuan Nguyen; Hieu-Thi Luong; Kong Aik; Lee; Eng Siong Chng

arXiv:2406.17376·cs.SD·September 10, 2024

Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

Duc-Tuan Truong, Ruijie Tao, Tuan Nguyen, Hieu-Thi Luong, Kong Aik, Lee, Eng Siong Chng

PDF

Open Access 1 Repo

TL;DR

This paper introduces a Temporal-Channel Modeling (TCM) module to enhance multi-head self-attention in Transformer-based synthetic speech detectors, significantly improving detection accuracy with minimal additional parameters.

Contribution

The paper proposes a novel TCM module that captures temporal-channel dependencies in MHSA, leading to improved synthetic speech detection performance.

Findings

01

TCM module outperforms state-of-the-art by 9.25% in EER on ASVspoof 2021.

02

Only 0.03M additional parameters are needed for the TCM module.

03

Utilizing both temporal and channel information yields the best detection results.

Abstract

Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in specific regions of both frequency channels and temporal segments, while MHSA neglects this temporal-channel dependency of the input sequence. In this work, we proposed a Temporal-Channel Modeling (TCM) module to enhance MHSA's capability for capturing temporal-channel dependencies. Experimental results on the ASVspoof 2021 show that with only 0.03M additional parameters, the TCM module can outperform the state-of-the-art system by 9.25% in EER. Further ablation study reveals that utilizing both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ductuantruong/tcm_add
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques

MethodsAttention Is All You Need · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Linear Layer · Absolute Position Encodings