ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models

Kaizhi Qian; Xulin Fan; Junrui Ni; Slava Shechtman; Mark Hasegawa-Johnson; Chuang Gan; Yang Zhang

arXiv:2507.20091·cs.CL·August 11, 2025

ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models

Kaizhi Qian, Xulin Fan, Junrui Ni, Slava Shechtman, Mark Hasegawa-Johnson, Chuang Gan, Yang Zhang

PDF

1 Models

TL;DR

ProsodyLM introduces a novel tokenization scheme that enables speech language models to better learn and process prosody information, leading to emerging capabilities in understanding and generating nuanced speech features.

Contribution

The paper proposes a new prosody-aware tokenization method that enhances speech language models' ability to learn prosody during pre-training.

Findings

01

ProsodyLM captures prosody nuances like contrastive focus.

02

It understands emotion and stress in speech.

03

It maintains prosody consistency over long contexts.

Abstract

Speech language models refer to language models with speech processing and understanding capabilities. One key desirable capability for speech language models is the ability to capture the intricate interdependency between content and prosody. The existing mainstream paradigm of training speech language models, which converts speech into discrete tokens before feeding them into LLMs, is sub-optimal in learning prosody information -- we find that the resulting LLMs do not exhibit obvious emerging prosody processing capabilities via pre-training alone. To overcome this, we propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody. Each speech utterance is first transcribed into text, followed by a sequence of word-level prosody tokens. Compared with conventional speech tokenization schemes, the proposed tokenization scheme retains more complete prosody…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
auspicious3000/prosodylm
model· ♡ 3
♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.