TL;DR
DASB introduces a comprehensive benchmarking framework for discrete audio tokens across multiple domains, revealing their current limitations and guiding future improvements.
Contribution
The paper presents DASB, a standardized benchmark for evaluating discrete audio tokens, addressing inconsistencies and providing insights into their robustness and performance.
Findings
Discrete representations are less robust than continuous ones.
Semantic tokens outperform acoustic tokens but still lag behind continuous features.
Careful tuning of model factors is essential for optimal performance.
Abstract
Discrete audio tokens have recently gained considerable attention for their potential to bridge audio and language processing, enabling multimodal language models that can both generate and understand audio. However, preserving key information such as phonetic content, speaker identity, and paralinguistic cues remains a major challenge. Identifying the optimal tokenizer and configuration is further complicated by inconsistent evaluation settings across existing studies. To address this, we introduce the Discrete Audio and Speech Benchmark (DASB), a comprehensive framework for benchmarking discrete audio tokens across speech, general audio, and music domains on a range of discriminative and generative tasks. Our results show that discrete representations are less robust than continuous ones and require careful tuning of factors such as model architecture, data size, learning rate, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗speechbrain/hifigan-hubert-l1-3-7-12-18-23-k1000-LibriTTSmodel· 70 dl70 dl
- 🤗speechbrain/hifigan-wavlm-l1-3-7-12-18-23-k1000-LibriTTSmodel· 7 dl7 dl
- 🤗speechbrain/hifigan-wav2vec-l1-3-7-12-18-23-k1000-LibriTTSmodel· 5 dl· ♡ 15 dl♡ 1
- 🤗speechbrain/hifigan-wavlm-l1-3-7-12-18-23-continuous-LibriTTSmodel· 9 dl9 dl
- 🤗speechbrain/hifigan-wavlm-k1000-LibriTTSmodel· 44 dl· ♡ 244 dl♡ 2
- 🤗speechbrain/hifigan-wav2vec2-k1000-LibriTTSmodel· 2 dl2 dl
- 🤗speechbrain/hifigan-hubert-k1000-LibriTTSmodel· 11 dl· ♡ 111 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
