Speech Tokenizer is Key to Consistent Representation
Wonjin Jung, Sungil Kang, Dong-Yeon Cho

TL;DR
This paper presents a novel speech tokenizer that encodes both linguistic and acoustic features, significantly improving speech representation fidelity across various applications without extra training.
Contribution
It introduces an advanced speech tokenizer that captures both semantic and acoustic information, addressing limitations of previous RVQ-based methods.
Findings
Enhanced speech coding quality
Improved emotion recognition accuracy
Versatile application across speech tasks
Abstract
Speech tokenization is crucial in digital speech processing, converting continuous speech signals into discrete units for various computational tasks. This paper introduces a novel speech tokenizer with broad applicability across downstream tasks. While recent advances in residual vector quantization (RVQ) have incorporated semantic elements, they often neglect critical acoustic features. We propose an advanced approach that simultaneously encodes both linguistic and acoustic information, preserving prosodic and emotional content. Our method significantly enhances speech representation fidelity across diverse applications. Empirical evaluations demonstrate its effectiveness in speech coding, voice conversion, emotion recognition, and multimodal language modeling, without requiring additional training. This versatility underscores its potential as a key tool for advancing AI-driven…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Speech and Audio Processing
