Phonological Tokenizer: Prosody-Aware Phonetic Token via Multi-Objective Fine-Tuning with Differentiable K-Means

Kentaro Onda; Hayato Futami; Yosuke Kashiwagi; Emiru Tsunoo; Shinji Watanabe

arXiv:2601.19781·cs.SD·January 28, 2026

Phonological Tokenizer: Prosody-Aware Phonetic Token via Multi-Objective Fine-Tuning with Differentiable K-Means

Kentaro Onda, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe

PDF

Open Access

TL;DR

This paper introduces the Phonological Tokenizer, a method that fine-tunes phonetic tokens to retain linguistic and prosodic information while discarding speaker identity, improving speech representations for prosody-sensitive tasks.

Contribution

It proposes a novel multi-objective fine-tuning approach using differentiable k-means to enhance phonetic tokens with prosodic features for speech models.

Findings

01

Tokens retain phonological and prosodic information

02

Speaker identity is effectively discarded

03

Improved performance on prosody-sensitive tasks

Abstract

In recent years, there has been growing interest in representing speech with discrete tokens, which serve as pseudo-text for speech language models (speechLMs) and as efficient intermediate representations for downstream tasks. These tokens are typically categorized as acoustic and phonetic tokens: the former holds detailed acoustic information for reconstruction while the latter mainly captures linguistic content. In human speech communication, however, unnecessary acoustic details such as speaker information are abstracted, while both linguistic and prosodic information are utilized for speech comprehension and production. Given this, neither type of token seems an ideal representation for tasks sensitive to prosody, such as speechLMs. In this study, we propose the Phonological Tokenizer, a method that fine-tunes phonetic tokens via differentiable k-means with a multi-task objective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Topic Modeling