DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models

Heng-Jui Chang; Hongyu Gong; Changhan Wang; James Glass; Yu-An Chung

arXiv:2410.24177·eess.AS·November 1, 2024

DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models

Heng-Jui Chang, Hongyu Gong, Changhan Wang, James Glass, Yu-An Chung

PDF

Open Access 3 Reviews

TL;DR

DC-Spin introduces a novel speech tokenization method that produces speaker-invariant, phonetic-rich tokens, improving zero-shot speech language model tasks and enabling streamable processing without retraining.

Contribution

The paper proposes DC-Spin, a new clustering-based speech tokenizer that enhances speaker invariance and streamability for spoken language models, with comprehensive comparisons and insights.

Findings

01

Tokens are robust and speaker-invariant.

02

Streamable chunk-wise approach works without retraining.

03

Tokens align well with phonemes and support strong language modeling.

Abstract

Spoken language models (SLMs) have gained increasing attention with advancements in text-based, decoder-only language models. SLMs process text and speech, enabling simultaneous speech understanding and generation. This paper presents Double-Codebook Speaker-invariant Clustering (DC-Spin), which aims to improve speech tokenization by bridging audio signals and SLM tokens. DC-Spin extracts speaker-invariant tokens rich in phonetic information and resilient to input variations, enhancing zero-shot SLM tasks and speech resynthesis. We propose a chunk-wise approach to enable streamable DC-Spin without retraining and degradation. Comparisons of tokenization methods (self-supervised and neural audio codecs), model scalability, and downstream task proxies show that tokens easily modeled by an n-gram LM or aligned with phonemes offer strong performance, providing insights for designing speech…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 5

Strengths

1. Introduced the Double-Codebook Spin to capture the fine-grained phonetic units better, and gave the detail process about how to select the codebook size. 2. The experimental setup is thorough, covering both zero-shot and supervised tasks, with a clear evaluation on robustness and inference efficiency. 3. This work analyzed multiple proxy tasks on speech tokenizers to reveal their relationship with the performance of the spoken language model.

Weaknesses

1. Although this work states three contributions, they were incremental somehow. For example, the DC-Spin doesn't show convincing advantages against Spin method in Table 2, especially considering the much larger codebook size of DC-Spin. As for the chunk-wise streaming simulation, it has been used in many similar works such as ASR and TTS. And the authors claimed it in the contribution part but they just mentioned it in a short paragraph in the experiment section. 2. The compared neural audio co

Reviewer 02Rating 8Confidence 2

Strengths

* **Originality:** The paper’s contribution of SpinHuBERT and DC-Spin tokenizers demonstrates originality. * **Quality:** The evaluation is comprehensive, evaluating the proposed tokenizers across multiple downstream tasks and metrics to show their effectiveness. * **Clarity:** The presentation is clear. * **Significance:** Disentangling factors within discrete speech tokens, as explored in DC-Spin, addresses a growing area of interest (e.g., FACodec). DC-Spin seems to stand out as one of the f

Weaknesses

**Comparison in Speech Resynthesis:** While the results show that DC-Spin performs well at low bitrates (<1.5 kbps), it is worth noting that current state-of-the-art speech synthesis systems (e.g., NaturalSpeech2, VoiceCraft, MaskGCT) use codecs with higher bitrates to achieve high-quality speech reconstruction. A comparison of DC-Spin against existing tokenizers at these higher bitrates would provide valuable insight into its performance for speech resynthesis under more typical conditions.

Reviewer 03Rating 1Confidence 4

Strengths

This paper proposes a method called DC-Spin to improve speech tokenization. The speech tokens generated by the proposed method is speaker-invariant and can preserve more content information compared to open-source tokenizers.

Weaknesses

The motivation of this paper does not sound convincing. A common way of connecting a speech encoder to a LM is directly through speech representation, which is of continuous values. The discretization from representation to discrete tokens will only degrade the quality of the speech input. Check out the papers below: https://arxiv.org/pdf/2402.08846 https://arxiv.org/abs/2309.00169 https://www.isca-archive.org/interspeech_2024/yang24f_interspeech.pdf The discretized tokens will eventually be co

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems

MethodsSoftmax · Attention Is All You Need