Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification

Jin Sob Kim; Hyun Joon Park; Wooseok Shin; Sung Won Han

arXiv:2512.22148·cs.SD·December 30, 2025

Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification

Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Sung Won Han

PDF

Open Access

TL;DR

This paper introduces Layer Attentive Pooling (LAP), a dynamic and effective method for aggregating multi-layer features from pre-trained speech models, significantly improving speaker verification performance.

Contribution

The paper proposes LAP, a novel dynamic layer aggregation technique, and a lightweight backend model combining LAP and ASTP, achieving state-of-the-art results with reduced training time.

Findings

01

LAP outperforms static averaging methods in speaker verification.

02

The proposed architecture achieves state-of-the-art performance on VoxCeleb.

03

Dynamic weighting captures speaker characteristics more effectively.

Abstract

Recent speaker verification studies have achieved notable success by leveraging layer-wise output from pre-trained Transformer models. However, few have explored the advancements in aggregating these multi-level features beyond the static weighted average. We present Layer Attentive Pooling (LAP), a novel strategy for aggregating inter-layer representations from pre-trained speech models for speaker verification. LAP assesses the significance of each layer from multiple perspectives time-dynamically, and employs max pooling instead of averaging. Additionally, we propose a lightweight backend speaker model comprising LAP and Attentive Statistical Temporal Pooling (ASTP) to extract speaker embeddings from pre-trained model output. Experiments on the VoxCeleb benchmark reveal that our compact architecture achieves state-of-the-art performance while greatly reducing the training time. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing