Speaker Recognition Using Isomorphic Graph Attention Network Based   Pooling on Self-Supervised Representation

Zirui Ge; Xinzhou Xu; Haiyan Guo; Tingting Wang; Zhen Yang

arXiv:2308.04666·cs.SD·February 27, 2024·1 cites

Speaker Recognition Using Isomorphic Graph Attention Network Based Pooling on Self-Supervised Representation

Zirui Ge, Xinzhou Xu, Haiyan Guo, Tingting Wang, Zhen Yang

PDF

Open Access

TL;DR

This paper introduces IsoGAT, a novel graph attention pooling method for speaker recognition that leverages self-supervised speech representations, improving aggregation and recognition accuracy over existing methods.

Contribution

It proposes IsoGAT, an isomorphic graph attention network, for more effective pooling of self-supervised speech representations in speaker recognition.

Findings

01

IsoGAT outperforms existing pooling methods on VoxCeleb datasets.

02

The approach enhances speaker recognition accuracy using self-supervised features.

03

Experimental results validate the effectiveness of IsoGAT in real-world scenarios.

Abstract

The emergence of self-supervised representation (i.e., wav2vec 2.0) allows speaker-recognition approaches to process spoken signals through foundation models built on speech data. Nevertheless, effective fusion on the representation requires further investigating, due to the inclusion of fixed or sub-optimal temporal pooling strategies. Despite of improved strategies considering graph learning and graph attention factors, non-injective aggregation still exists in the approaches, which may influence the performance for speaker recognition. In this regard, we propose a speaker recognition approach using Isomorphic Graph ATtention network (IsoGAT) on self-supervised representation. The proposed approach contains three modules of representation learning, graph attention, and aggregation, jointly considering learning on the self-supervised representation and the IsoGAT. Then, we perform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Text and Document Classification Technologies · Music and Audio Processing