A Variational Framework for Improving Naturalness in Generative Spoken Language Models

Li-Wei Chen; Takuya Higuchi; Zakaria Aldeneh; Ahmed Hussen Abdelaziz; Alexander Rudnicky

arXiv:2506.14767·cs.CL·June 18, 2025

A Variational Framework for Improving Naturalness in Generative Spoken Language Models

Li-Wei Chen, Takuya Higuchi, Zakaria Aldeneh, Ahmed Hussen Abdelaziz, Alexander Rudnicky

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a variational framework that automatically encodes prosodic and paralinguistic speech attributes to improve the naturalness of generative spoken language models, reducing manual feature engineering.

Contribution

It presents an end-to-end variational method that learns to encode continuous speech attributes, enhancing naturalness without manual feature selection.

Findings

01

Generated speech is rated more natural by human evaluators.

02

The method reduces reliance on hand-engineered features.

03

Code and models are publicly available.

Abstract

The success of large language models in text processing has inspired their adaptation to speech modeling. However, since speech is continuous and complex, it is often discretized for autoregressive modeling. Speech tokens derived from self-supervised models (known as semantic tokens) typically focus on the linguistic aspects of speech but neglect prosodic information. As a result, models trained on these tokens can generate speech with reduced naturalness. Existing approaches try to fix this by adding pitch features to the semantic tokens. However, pitch alone cannot fully represent the range of paralinguistic attributes, and selecting the right features requires careful hand-engineering. To overcome this, we propose an end-to-end variational approach that automatically learns to encode these continuous speech attributes to enhance the semantic tokens. Our approach eliminates the need…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

b04901014/vae-gslm
pytorchOfficial

Videos

A Variational Framework for Improving Naturalness in Generative Spoken Language Models· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems

MethodsFocus