Attentive Temporal Pooling for Conformer-based Streaming Language   Identification in Long-form Speech

Quan Wang; Yang Yu; Jason Pelecanos; Yiling Huang; Ignacio Lopez; Moreno

arXiv:2202.12163·eess.AS·May 3, 2022

Attentive Temporal Pooling for Conformer-based Streaming Language Identification in Long-form Speech

Quan Wang, Yang Yu, Jason Pelecanos, Yiling Huang, Ignacio Lopez, Moreno

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper presents a conformer-based streaming language identification system with attentive temporal pooling, demonstrating improved accuracy and domain adaptation capabilities over traditional models in long-form speech scenarios.

Contribution

The paper introduces an attentive temporal pooling mechanism for conformer models and explores domain adaptation methods for streaming language identification.

Findings

01

Conformer models outperform LSTM and transformer models.

02

Attentive temporal pooling enhances model accuracy.

03

Domain adaptation improves performance without retraining.

Abstract

In this paper, we introduce a novel language identification system based on conformer layers. We propose an attentive temporal pooling mechanism to allow the model to carry information in long-form audio via a recurrent form, such that the inference can be performed in a streaming fashion. Additionally, we investigate two domain adaptation approaches to allow adapting an existing language identification model without retraining the model parameters for a new domain. We perform a comparative study of different model topologies under different constraints of model size, and find that conformer-based models significantly outperform LSTM and transformer based models. Our experiments also show that attentive temporal pooling and domain adaptation improve model accuracy.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google/speaker-id/tree/master/lingvo
noneOfficial

Models

🤗
tflite-hub/conformer-lang-id
model· 53 dl
53 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory