TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding

Ruiteng Zhang; Jianguo Wei; Xugang Lu; Wenhuan Lu; Di Jin; Junhai Xu,; Lin Zhang; Yantao Ji; Jianwu Dang

arXiv:2203.09098·cs.SD·March 18, 2022·5 cites

TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding

Ruiteng Zhang, Jianguo Wei, Xugang Lu, Wenhuan Lu, Di Jin, Junhai Xu,, Lin Zhang, Yantao Ji, Jianwu Dang

PDF

Open Access

TL;DR

This paper introduces TMS, a novel temporal multi-scale backbone for speaker embedding that efficiently captures multi-scale features with minimal computational cost, significantly improving speaker verification performance and inference speed.

Contribution

The paper proposes a TMS model that separates channel and temporal modeling, enabling multi-scale feature extraction with low additional parameters and a re-parameterization for faster inference.

Findings

01

TMS outperforms state-of-the-art models in speaker verification accuracy.

02

The re-parameterization technique accelerates inference speed.

03

The model maintains low computational costs despite multi-scale complexity.

Abstract

Speaker embedding is an important front-end module to explore discriminative speaker features for many speech applications where speaker information is needed. Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation. However, naively adding many branches of multi-scale features with the simple fully convolutional operation could not efficiently improve the performance due to the rapid increase of model parameters and computational complexity. Therefore, in the most current state-of-the-art network architectures, only a few branches corresponding to a limited number of temporal scales could be designed for speaker embeddings. To address this problem, in this paper, we propose an effective temporal multi-scale (TMS) model where multi-scale branches could be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing