On the use of Performer and Agent Attention for Spoken Language   Identification

Jitendra Kumar dhiman; Jainag Ambati

arXiv:2502.05841·eess.AS·February 11, 2025

On the use of Performer and Agent Attention for Spoken Language Identification

Jitendra Kumar dhiman, Jainag Ambati

PDF

Open Access

TL;DR

This paper investigates the effectiveness of Performer and Agent-Attention mechanisms combined with statistical pooling for spoken language identification, demonstrating that Performer-Attention outperforms traditional self-attention with less computational cost.

Contribution

It introduces and evaluates Performer and Agent-Attention mechanisms in LID, showing their advantages over standard self-attention in accuracy and efficiency.

Findings

01

Performer-Attention outperforms self-attention in LID tasks.

02

Agent-Attention performs comparably or better than self-attention.

03

Performer-Attention is more computationally efficient.

Abstract

One of the methods for language Identification (LID) involves deriving speech representation from pre-trained models using self-supervised learning, followed by fine-tuning the model for the LID task. State-of-the-art approaches for LID use an attention-based statistical pooling layer to facilitate the aggregation of contextual information across time frames of the embedding vectors extracted from the pre-trained model. In this paper, we delve into exploring recently proposed attention mechanisms, namely performer and agent-attention, in conjunction with the statistical pooling layer. The LID experiments are performed on three datasets: VoxPopuli, FLEURS, and VoxLingua. We compare their performance against vanilla self-attention. Our findings suggest that performer-attention outperforms self-attention and agent-attention exhibits comparable or occasionally superior performance to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsSoftmax · Attention Is All You Need · Fast Attention Via Positive Orthogonal Random Features · Performer