How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

Pooneh Mousavi; Jarod Duret; Salah Zaiem; Luca Della Libera; Artem; Ploujnikov; Cem Subakan; Mirco Ravanelli

arXiv:2406.10735·cs.SD·June 18, 2024

How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

Pooneh Mousavi, Jarod Duret, Salah Zaiem, Luca Della Libera, Artem, Ploujnikov, Cem Subakan, Mirco Ravanelli

PDF

Open Access 7 Models

TL;DR

This paper investigates optimal methods for extracting discrete audio tokens from self-supervised models, proposing a scalable universal vocoder and attention mechanism to improve token quality across various audio tasks.

Contribution

It introduces a scalable training approach for a universal vocoder and employs attention to identify task-specific SSL layers, enhancing semantic token extraction.

Findings

01

Universal vocoder enables consistent tokenization across SSL layers

02

Attention mechanism improves task-specific layer selection

03

Enhanced performance in diverse audio processing tasks

Abstract

Discrete audio tokens have recently gained attention for their potential to bridge the gap between audio and language processing. Ideal audio tokens must preserve content, paralinguistic elements, speaker identity, and many other audio details. Current audio tokenization methods fall into two categories: Semantic tokens, acquired through quantization of Self-Supervised Learning (SSL) models, and Neural compression-based tokens (codecs). Although previous studies have benchmarked codec models to identify optimal configurations, the ideal setup for quantizing pretrained SSL models remains unclear. This paper explores the optimal configuration of semantic tokens across discriminative and generative tasks. We propose a scalable solution to train a universal vocoder across multiple SSL layers. Furthermore, an attention mechanism is employed to identify task-specific influential layers,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Diverse Musicological Studies