Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding

Mingyue Huo; Wei-Cheng Tseng; Yiwen Shao; Hao Zhang; Dong Yu

arXiv:2511.15145·eess.AS·November 20, 2025

Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding

Mingyue Huo, Wei-Cheng Tseng, Yiwen Shao, Hao Zhang, Dong Yu

PDF

Open Access 1 Models

TL;DR

Auden-Voice is a versatile voice encoder designed to capture both speaker identity and paralinguistic cues, enhancing speech and language understanding by balancing multiple voice aspects through multi-task training.

Contribution

This work introduces Auden-Voice, a novel general-purpose voice encoder that effectively balances identity and paralinguistic features, outperforming existing methods in comprehensive evaluations.

Findings

01

Multi-task training yields balanced voice representations.

02

Contrastive language-audio pretraining improves retrieval but not paralinguistic understanding.

03

Auden-Voice performs well when integrated with large language models.

Abstract

Human voice encodes both identity and paralinguistic cues, yet encoders in large audio-language models (LALMs) rarely balance both aspects. In this work, we present a study toward building a general-purpose voice encoder that captures nuanced voice cues. Through a comprehensive evaluation, we find that multi-task training yields the most balanced representations, whereas contrastive language-audio pretraining (CLAP) primarily improves retrieval without enhancing paralinguistic understanding. Our final encoder, Auden-Voice, also demonstrates strong performance when integrated with LLMs. The code and training recipes will be released with the audio understanding toolkit Auden.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
AudenAI/auden-encoder-voice
model· 2 dl· ♡ 2
2 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research