TL;DR
TARNet is a lightweight, multi-scale, temporal-aware neural architecture for closed-set speaker identification that models dependencies across various time scales to improve accuracy.
Contribution
It introduces a multi-stage temporal encoder with stage-specific dilation and an attentive pooling mechanism for enhanced speaker embedding extraction.
Findings
TARNet outperforms state-of-the-art methods on VoxCeleb1 and LibriSpeech datasets.
It maintains competitive computational complexity for practical deployment.
The code is publicly available at https://github.com/YassinTERRAF/TARNet.
Abstract
Closed-Set speaker identification aims to assign a speech utterance to one of a predefined set of enrolled speakers and requires robust modeling of speaker-specific characteristics across multiple temporal scales. While recent deep learning approaches have achieved strong performance, many existing architectures provide limited mechanisms for modeling temporal dependencies across different time scales, which can restrict the effective use of complementary short-, mid-, and long-term speaker characteristics. In this paper, we propose TARNet, a lightweight Temporal-Aware Representation Network for closed-set speaker identification. TARNet explicitly models temporal information at multiple time scales using a multi-stage temporal encoder with stage-specific dilation configurations. The resulting multi-scale representations are fused and aggregated via an Attentive Statistics Pooling (ASP)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
