TASTE-Streaming: Towards Streamable Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

Liang-Hsuan Tseng; Hung-yi Lee

arXiv:2603.12350·cs.CL·March 16, 2026

TASTE-Streaming: Towards Streamable Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

Liang-Hsuan Tseng, Hung-yi Lee

PDF

Open Access

TL;DR

TASTE-S is a real-time, streamable extension of text-aligned speech tokenization that reduces latency and maintains performance, enabling natural spoken language interactions.

Contribution

It introduces TASTE-S, a streaming speech tokenization method that integrates CTC-based ASR and on-the-fly decoding for real-time spoken language modeling.

Findings

01

TASTE-S matches TASTE's performance in speech-text alignment.

02

TASTE-S significantly reduces latency in speech processing.

03

TASTE-S is robust to transcription errors and supports long-form encoding.

Abstract

Text-speech joint spoken language modeling (SLM) aims at natural and intelligent speech-based interactions, but developing such a system may suffer from modality mismatch: speech unit sequences are much longer than text tokens. Prior work reduces this gap with text-aligned tokenization and embedding (TASTE), producing speech tokens that align in lengths with their textual counterparts. However, the dependence on an external ASR system and the use of a non-causal decoder limits streaming use. To address this limitation, we propose TASTE-S, a streamable extension of TASTE suitable for real-time usage. TASTE-S integrates a CTC-based ASR module into the encoder for instant dual-modality encoding. We also redesign the unit decoder to enable on-the-fly decoding. With joint training, we show that TASTE-S matches TASTE's performance while significantly reducing latency. Further investigations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing