Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling
Ju-Chieh Chou, Jiawei Zhou, Karen Livescu

TL;DR
This paper introduces Flow-SLM, a novel spoken language model that jointly learns linguistic and acoustic representations, improving acoustic detail control while maintaining linguistic accuracy in speech generation.
Contribution
Flow-SLM is the first to jointly model semantic tokens and continuous acoustic features using flow-matching, enhancing speech quality and control in textless spoken language modeling.
Findings
Achieves comparable linguistic likelihood to existing models.
Provides better acoustic detail in generated speech.
Predicting multiple semantic tokens improves linguistic preservation.
Abstract
Textless spoken language models (SLMs) are generative models of speech that do not rely on text supervision. Most textless SLMs learn to predict the next semantic token, a discrete representation of linguistic content, and rely on a separate vocoder to add acoustic information to the generated speech. Such models have no access to acoustic context and no built-in control over acoustic details. In this work, we propose to jointly model linguistic and acoustic information by generating semantic tokens and a continuous real-valued representation of the acoustic frame. We use a flow-matching objective to predict the continuous vector conditioned on the semantic tokens. We study the design space of this approach and find that predicting multiple future semantic tokens helps preserve linguistic information. Our approach achieves comparable performance to existing models in terms of linguistic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
