Towards A Unified Conformer Structure: from ASR to ASV Task
Dexin Liao, Tao Jiang, Feng Wang, Lin Li, Qingyang Hong

TL;DR
This paper adapts the Conformer architecture from speech recognition to speaker verification, demonstrating competitive performance and potential for unifying ASR and ASV tasks through transfer learning and improved generalization techniques.
Contribution
The paper introduces a minimal modification of the Conformer for ASV, applies transfer learning from ASR, and evaluates its effectiveness and inference speed, highlighting its potential for unified speech tasks.
Findings
Conformer-based ASV achieves competitive results compared to ECAPA-TDNN.
Transfer learning from ASR improves ASV performance by 11% in EER.
The approach demonstrates potential for unifying ASR and ASV architectures.
Abstract
Transformer has achieved extraordinary performance in Natural Language Processing and Computer Vision tasks thanks to its powerful self-attention mechanism, and its variant Conformer has become a state-of-the-art architecture in the field of Automatic Speech Recognition (ASR). However, the main-stream architecture for Automatic Speaker Verification (ASV) is convolutional Neural Networks, and there is still much room for research on the Conformer based ASV. In this paper, firstly, we modify the Conformer architecture from ASR to ASV with very minor changes. Length-Scaled Attention (LSA) method and Sharpness-Aware Minimizationis (SAM) are adopted to improve model generalization. Experiments conducted on VoxCeleb and CN-Celeb show that our Conformer based ASV achieves competitive performance compared with the popular ECAPA-TDNN. Secondly, inspired by the transfer learning strategy, ASV…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsTest · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
