Convolution-Based Channel-Frequency Attention for Text-Independent   Speaker Verification

Jingyu Li; Yusheng Tian; Tan Lee

arXiv:2210.17310·eess.AS·November 1, 2022·1 cites

Convolution-Based Channel-Frequency Attention for Text-Independent Speaker Verification

Jingyu Li, Yusheng Tian, Tan Lee

PDF

Open Access

TL;DR

This paper introduces a lightweight 2D convolution-based attention module, C2D-Att, that enhances speaker verification by producing fine-grained channel-frequency attention maps, leading to improved performance on VoxCeleb datasets.

Contribution

The paper proposes a novel convolution-based attention module, C2D-Att, integrated into ResNet, which efficiently captures channel and frequency information for better speaker embedding.

Findings

01

C2D-Att outperforms other attention methods in speaker verification.

02

The model achieves state-of-the-art results on VoxCeleb datasets.

03

The approach is robust across different model sizes.

Abstract

Deep convolutional neural networks (CNNs) have been applied to extracting speaker embeddings with significant success in speaker verification. Incorporating the attention mechanism has shown to be effective in improving the model performance. This paper presents an efficient two-dimensional convolution-based attention module, namely C2D-Att. The interaction between the convolution channel and frequency is involved in the attention calculation by lightweight convolution layers. This requires only a small number of parameters. Fine-grained attention weights are produced to represent channel and frequency-specific information. The weights are imposed on the input features to improve the representation ability for speaker modeling. The C2D-Att is integrated into a modified version of ResNet for speaker embedding extraction. Experiments are conducted on VoxCeleb datasets. The results show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Batch Normalization · 1x1 Convolution · Max Pooling · Average Pooling · Residual Connection · Bottleneck Residual Block · Residual Block · Convolution · Global Average Pooling