Towards A Unified Conformer Structure: from ASR to ASV Task

Dexin Liao; Tao Jiang; Feng Wang; Lin Li; Qingyang Hong

arXiv:2211.07201·eess.AS·January 18, 2023

Towards A Unified Conformer Structure: from ASR to ASV Task

Dexin Liao, Tao Jiang, Feng Wang, Lin Li, Qingyang Hong

PDF

Open Access 1 Repo

TL;DR

This paper adapts the Conformer architecture from speech recognition to speaker verification, demonstrating competitive performance and potential for unifying ASR and ASV tasks through transfer learning and improved generalization techniques.

Contribution

The paper introduces a minimal modification of the Conformer for ASV, applies transfer learning from ASR, and evaluates its effectiveness and inference speed, highlighting its potential for unified speech tasks.

Findings

01

Conformer-based ASV achieves competitive results compared to ECAPA-TDNN.

02

Transfer learning from ASR improves ASV performance by 11% in EER.

03

The approach demonstrates potential for unifying ASR and ASV architectures.

Abstract

Transformer has achieved extraordinary performance in Natural Language Processing and Computer Vision tasks thanks to its powerful self-attention mechanism, and its variant Conformer has become a state-of-the-art architecture in the field of Automatic Speech Recognition (ASR). However, the main-stream architecture for Automatic Speaker Verification (ASV) is convolutional Neural Networks, and there is still much room for research on the Conformer based ASV. In this paper, firstly, we modify the Conformer architecture from ASR to ASV with very minor changes. Length-Scaled Attention (LSA) method and Sharpness-Aware Minimizationis (SAM) are adopted to improve model generalization. Experiments conducted on VoxCeleb and CN-Celeb show that our Conformer based ASV achieves competitive performance compared with the popular ECAPA-TDNN. Secondly, inspired by the transfer learning strategy, ASV…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Snowdar/asv-subtools
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsTest · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings