How Should We Extract Discrete Audio Tokens from Self-Supervised Models?
Pooneh Mousavi, Jarod Duret, Salah Zaiem, Luca Della Libera, Artem, Ploujnikov, Cem Subakan, Mirco Ravanelli

TL;DR
This paper investigates optimal methods for extracting discrete audio tokens from self-supervised models, proposing a scalable universal vocoder and attention mechanism to improve token quality across various audio tasks.
Contribution
It introduces a scalable training approach for a universal vocoder and employs attention to identify task-specific SSL layers, enhancing semantic token extraction.
Findings
Universal vocoder enables consistent tokenization across SSL layers
Attention mechanism improves task-specific layer selection
Enhanced performance in diverse audio processing tasks
Abstract
Discrete audio tokens have recently gained attention for their potential to bridge the gap between audio and language processing. Ideal audio tokens must preserve content, paralinguistic elements, speaker identity, and many other audio details. Current audio tokenization methods fall into two categories: Semantic tokens, acquired through quantization of Self-Supervised Learning (SSL) models, and Neural compression-based tokens (codecs). Although previous studies have benchmarked codec models to identify optimal configurations, the ideal setup for quantizing pretrained SSL models remains unclear. This paper explores the optimal configuration of semantic tokens across discriminative and generative tasks. We propose a scalable solution to train a universal vocoder across multiple SSL layers. Furthermore, an attention mechanism is employed to identify task-specific influential layers,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗speechbrain/hifigan-hubert-l1-3-7-12-18-23-k1000-LibriTTSmodel· 125 dl125 dl
- 🤗speechbrain/hifigan-wavlm-l1-3-7-12-18-23-k1000-LibriTTSmodel· 2 dl2 dl
- 🤗speechbrain/hifigan-wav2vec-l1-3-7-12-18-23-k1000-LibriTTSmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗speechbrain/hifigan-wavlm-l1-3-7-12-18-23-continuous-LibriTTSmodel· 4 dl4 dl
- 🤗speechbrain/hifigan-wavlm-k1000-LibriTTSmodel· 25 dl· ♡ 225 dl♡ 2
- 🤗speechbrain/hifigan-wav2vec2-k1000-LibriTTSmodel· 7 dl7 dl
- 🤗speechbrain/hifigan-hubert-k1000-LibriTTSmodel· 18 dl· ♡ 118 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Diverse Musicological Studies
