T-vectors: Weakly Supervised Speaker Identification Using Hierarchical Transformer Model
Yanpei Shi, Mingjie Chen, Qiang Huang, Thomas Hain

TL;DR
This paper introduces T-vectors, a hierarchical transformer-based model with memory for weakly supervised multi-speaker identification, demonstrating improved accuracy over existing methods on synthetic datasets.
Contribution
The paper proposes a novel hierarchical transformer model with a memory mechanism for speaker identification without explicit speaker localization.
Findings
Achieved 13.3% and 10.5% relative improvements over baseline methods.
Memory mechanism contributed to 10.6% and 7.7% performance gains.
Effective on artificial datasets derived from SWBC and Voxceleb1.
Abstract
Identifying multiple speakers without knowing where a speaker's voice is in a recording is a challenging task. This paper proposes a hierarchical network with transformer encoders and memory mechanism to address this problem. The proposed model contains a frame-level encoder and segment-level encoder, both of them make use of the transformer encoder block. The multi-head attention mechanism in the transformer structure could better capture different speaker properties when the input utterance contains multiple speakers. The memory mechanism used in the frame-level encoders can build a recurrent connection that better capture long-term speaker features. The experiments are conducted on artificial datasets based on the Switchboard Cellular part1 (SWBC) and Voxceleb1 datasets. In different data construction scenarios (Concat and Overlap), the proposed model shows better performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsAttention Is All You Need · Softmax · Linear Layer · Multi-Head Attention
