Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding
Mingyue Huo, Wei-Cheng Tseng, Yiwen Shao, Hao Zhang, Dong Yu

TL;DR
Auden-Voice is a versatile voice encoder designed to capture both speaker identity and paralinguistic cues, enhancing speech and language understanding by balancing multiple voice aspects through multi-task training.
Contribution
This work introduces Auden-Voice, a novel general-purpose voice encoder that effectively balances identity and paralinguistic features, outperforming existing methods in comprehensive evaluations.
Findings
Multi-task training yields balanced voice representations.
Contrastive language-audio pretraining improves retrieval but not paralinguistic understanding.
Auden-Voice performs well when integrated with large language models.
Abstract
Human voice encodes both identity and paralinguistic cues, yet encoders in large audio-language models (LALMs) rarely balance both aspects. In this work, we present a study toward building a general-purpose voice encoder that captures nuanced voice cues. Through a comprehensive evaluation, we find that multi-task training yields the most balanced representations, whereas contrastive language-audio pretraining (CLAP) primarily improves retrieval without enhancing paralinguistic understanding. Our final encoder, Auden-Voice, also demonstrates strong performance when integrated with LLMs. The code and training recipes will be released with the audio understanding toolkit Auden.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
