Speech Tokenizer is Key to Consistent Representation

Wonjin Jung; Sungil Kang; Dong-Yeon Cho

arXiv:2507.06802·cs.LG·July 10, 2025

Speech Tokenizer is Key to Consistent Representation

Wonjin Jung, Sungil Kang, Dong-Yeon Cho

PDF

Open Access

TL;DR

This paper presents a novel speech tokenizer that encodes both linguistic and acoustic features, significantly improving speech representation fidelity across various applications without extra training.

Contribution

It introduces an advanced speech tokenizer that captures both semantic and acoustic information, addressing limitations of previous RVQ-based methods.

Findings

01

Enhanced speech coding quality

02

Improved emotion recognition accuracy

03

Versatile application across speech tasks

Abstract

Speech tokenization is crucial in digital speech processing, converting continuous speech signals into discrete units for various computational tasks. This paper introduces a novel speech tokenizer with broad applicability across downstream tasks. While recent advances in residual vector quantization (RVQ) have incorporated semantic elements, they often neglect critical acoustic features. We propose an advanced approach that simultaneously encodes both linguistic and acoustic information, preserving prosodic and emotional content. Our method significantly enhances speech representation fidelity across diverse applications. Empirical evaluations demonstrate its effectiveness in speech coding, voice conversion, emotion recognition, and multimodal language modeling, without requiring additional training. This versatility underscores its potential as a key tool for advancing AI-driven…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Speech and Audio Processing