Weakly Supervised Training of Hierarchical Attention Networks for   Speaker Identification

Yanpei Shi; Qiang Huang; Thomas Hain

arXiv:2005.07817·eess.AS·August 28, 2020·1 cites

Weakly Supervised Training of Hierarchical Attention Networks for Speaker Identification

Yanpei Shi, Qiang Huang, Thomas Hain

PDF

Open Access

TL;DR

This paper introduces a hierarchical attention network trained with weak labels to identify multiple speakers in recordings without explicit voice location annotations, improving performance over baselines.

Contribution

It proposes a novel hierarchical attention network architecture for weakly supervised speaker identification, combining frame-level and segment-level encoders with attention mechanisms.

Findings

01

Outperforms baseline methods on artificial datasets.

02

Effective in both overlapped and non-overlapped speech conditions.

03

Segmentation improves identification accuracy slightly.

Abstract

Identifying multiple speakers without knowing where a speaker's voice is in a recording is a challenging task. In this paper, a hierarchical attention network is proposed to solve a weakly labelled speaker identification problem. The use of a hierarchical structure, consisting of a frame-level encoder and a segment-level encoder, aims to learn speaker related information locally and globally. Speech streams are segmented into fragments. The frame-level encoder with attention learns features and highlights the target related frames locally, and output a fragment based embedding. The segment-level encoder works with a second attention layer to emphasize the fragments probably related to target speakers. The global information is finally collected from segment-level module to predict speakers via a classifier. To evaluate the effectiveness of the proposed approach, artificial datasets…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing