PCF: ECAPA-TDNN with Progressive Channel Fusion for Speaker Verification
Zhenduo Zhao, Zhuo Li, Wenchao Wang, Pengyuan Zhang

TL;DR
This paper introduces a novel progressive channel fusion strategy and model enlargement techniques to enhance ECAPA-TDNN for speaker verification, achieving significant improvements in error rates and detection costs.
Contribution
The paper proposes a progressive channel fusion approach and deepens the ECAPA-TDNN architecture to better capture time-frequency relevance and improve speaker verification performance.
Findings
Achieved EER of 0.718 on VoxCeleb1-O.
Reduced minDCF(0.01) to 0.0858.
Improved performance by 16.1 ext{%} and 19.5 ext{%} over ECAPA-TDNN-large.
Abstract
ECAPA-TDNN is currently the most popular TDNN-series model for speaker verification, which refreshed the state-of-the-art(SOTA) performance of TDNN models. However, one-dimensional convolution has a global receptive field over the feature channel. It destroys the time-frequency relevance of the spectrogram. Besides, as ECAPA-TDNN only has five layers, a much shallower structure compared to ResNet restricts the capability to generate deep representations. To further improve ECAPA-TDNN, we propose a progressive channel fusion strategy that splits the spectrogram across the feature channel and gradually expands the receptive field through the network. Secondly, we enlarge the model by extending the depth and adding branches. Our proposed model achieves EER with 0.718 and minDCF(0.01) with 0.0858 on vox1o, relatively improved 16.1\% and 19.5\% compared with ECAPA-TDNN-large.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
