T-vectors: Weakly Supervised Speaker Identification Using Hierarchical   Transformer Model

Yanpei Shi; Mingjie Chen; Qiang Huang; Thomas Hain

arXiv:2010.16071·cs.SD·November 2, 2020·6 cites

T-vectors: Weakly Supervised Speaker Identification Using Hierarchical Transformer Model

Yanpei Shi, Mingjie Chen, Qiang Huang, Thomas Hain

PDF

Open Access

TL;DR

This paper introduces T-vectors, a hierarchical transformer-based model with memory for weakly supervised multi-speaker identification, demonstrating improved accuracy over existing methods on synthetic datasets.

Contribution

The paper proposes a novel hierarchical transformer model with a memory mechanism for speaker identification without explicit speaker localization.

Findings

01

Achieved 13.3% and 10.5% relative improvements over baseline methods.

02

Memory mechanism contributed to 10.6% and 7.7% performance gains.

03

Effective on artificial datasets derived from SWBC and Voxceleb1.

Abstract

Identifying multiple speakers without knowing where a speaker's voice is in a recording is a challenging task. This paper proposes a hierarchical network with transformer encoders and memory mechanism to address this problem. The proposed model contains a frame-level encoder and segment-level encoder, both of them make use of the transformer encoder block. The multi-head attention mechanism in the transformer structure could better capture different speaker properties when the input utterance contains multiple speakers. The memory mechanism used in the frame-level encoders can build a recurrent connection that better capture long-term speaker features. The experiments are conducted on artificial datasets based on the Switchboard Cellular part1 (SWBC) and Voxceleb1 datasets. In different data construction scenarios (Concat and Overlap), the proposed model shows better performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsAttention Is All You Need · Softmax · Linear Layer · Multi-Head Attention