Weakly Supervised Training of Hierarchical Attention Networks for Speaker Identification
Yanpei Shi, Qiang Huang, Thomas Hain

TL;DR
This paper introduces a hierarchical attention network trained with weak labels to identify multiple speakers in recordings without explicit voice location annotations, improving performance over baselines.
Contribution
It proposes a novel hierarchical attention network architecture for weakly supervised speaker identification, combining frame-level and segment-level encoders with attention mechanisms.
Findings
Outperforms baseline methods on artificial datasets.
Effective in both overlapped and non-overlapped speech conditions.
Segmentation improves identification accuracy slightly.
Abstract
Identifying multiple speakers without knowing where a speaker's voice is in a recording is a challenging task. In this paper, a hierarchical attention network is proposed to solve a weakly labelled speaker identification problem. The use of a hierarchical structure, consisting of a frame-level encoder and a segment-level encoder, aims to learn speaker related information locally and globally. Speech streams are segmented into fragments. The frame-level encoder with attention learns features and highlights the target related frames locally, and output a fragment based embedding. The segment-level encoder works with a second attention layer to emphasize the fragments probably related to target speakers. The global information is finally collected from segment-level module to predict speakers via a classifier. To evaluate the effectiveness of the proposed approach, artificial datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
