DT-SV: A Transformer-based Time-domain Approach for Speaker Verification
Nan Zhang, Jianzong Wang, Zhenhou Hong, Chendong Zhao, Xiaoyang Qu,, Jing Xiao

TL;DR
This paper introduces DT-SV, a Transformer-based speaker verification model that employs a novel diffluence loss and a learnable time-domain feature extractor, achieving faster training and improved accuracy.
Contribution
The paper proposes a new Transformer-based SV approach with diffluence loss and a learnable feature extractor, enhancing embedding quality and training efficiency.
Findings
Achieves higher accuracy than existing models.
Faster training speed due to novel architecture.
Improved speaker embedding quality.
Abstract
Speaker verification (SV) aims to determine whether the speaker's identity of a test utterance is the same as the reference speech. In the past few years, extracting speaker embeddings using deep neural networks for SV systems has gone mainstream. Recently, different attention mechanisms and Transformer networks have been explored widely in SV fields. However, utilizing the original Transformer in SV directly may have frame-level information waste on output features, which could lead to restrictions on capacity and discrimination of speaker embeddings. Therefore, we propose an approach to derive utterance-level speaker embeddings via a Transformer architecture that uses a novel loss function named diffluence loss to integrate the feature information of different Transformer layers. Therein, the diffluence loss aims to aggregate frame-level features into an utterance-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Softmax · Layer Normalization · Byte Pair Encoding · Dense Connections · Absolute Position Encodings · Dropout
