Speaker Verification with Speech-Aware LLMs: Evaluation and Augmentation

Thomas Thebaud; Yuzhe Wang; Laureano Moro-Velazquez; Jesus Villalba-Lopez; Najim Dehak

arXiv:2603.10827·cs.SD·March 12, 2026

Speaker Verification with Speech-Aware LLMs: Evaluation and Augmentation

Thomas Thebaud, Yuzhe Wang, Laureano Moro-Velazquez, Jesus Villalba-Lopez, Najim Dehak

PDF

Open Access

TL;DR

This paper evaluates speech-aware large language models for speaker verification, revealing their limited speaker discrimination ability and proposing an augmentation method that significantly improves verification performance while maintaining language capabilities.

Contribution

It introduces a model-agnostic scoring protocol for speaker verification and a lightweight augmentation method to enhance LLMs with speaker verification capabilities.

Findings

01

Weak speaker discrimination in current speech-aware LLMs (EER > 20%)

02

Proposed augmentation improves EER to 1.03% on VoxCeleb1-E

03

Method preserves natural language interface while adding speaker verification

Abstract

Speech-aware large language models (LLMs) can accept speech inputs, yet their training objectives largely emphasize linguistic content or specific fields such as emotions or the speaker's gender, leaving it unclear whether they encode speaker identity. First, we propose a model-agnostic scoring protocol that produces continuous verification scores for both API-only and open-weight models, using confidence scores or log-likelihood ratios from the Yes/No token probabilities. Using this protocol, we benchmark recent speech-aware LLMs and observe weak speaker discrimination (EERs above 20% on VoxCeleb1). Second, we introduce a lightweight augmentation that equips an LLM with ASV capability by injecting frozen ECAPA-TDNN speaker embeddings through a learned projection and training only LoRA adapters. On TinyLLaMA-1.1B, the resulting ECAPA-LLM achieves 1.03% EER on VoxCeleb1-E, approaching a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques