Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM

Jiatong Shi; Chunlei Zhang; Jinchuan Tian; Junrui Ni; Hao Zhang; Shinji Watanabe; Dong Yu

arXiv:2502.16897·eess.AS·December 1, 2025

Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM

Jiatong Shi, Chunlei Zhang, Jinchuan Tian, Junrui Ni, Hao Zhang, Shinji Watanabe, Dong Yu

PDF

Open Access

TL;DR

This paper introduces a continual pre-training framework that adapts speech LLMs to handle codec-based speech, enabling a unified model for understanding and generation tasks with strong performance and novel end-to-end translation capabilities.

Contribution

The paper presents a novel CPT approach for codec-based speech LLMs, achieving unified understanding and generation, including the first end-to-end neural codec token-based S2S translation system.

Findings

01

Effective cross-modal alignment with CPT

02

Strong performance across ASR, TTS, and translation tasks

03

First end-to-end neural codec token S2S translation system

Abstract

Recent advances in speech language models (LLMs) have extended textual LLMs to the speech domain, but balancing speech understanding and generation remains challenging, especially with codec-based representations. We propose a continual pre-training (CPT) framework that adapts a textual LLM to handle codec-discretized speech, mitigating modality mismatch and preserving linguistic reasoning. Our unified model supports both understanding and generation, achieving strong results across ASR, TTS, S2T-Trans, and S2S-Trans. Notably, we present the first end-to-end, single-pass S2S-Trans system using only neural codec tokens, without intermediate transcriptions, translations, or semantic tokens. CPT proves essential for cross-modal alignment and task generalization, making it a powerful tool for building robust, unified speech LLMs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems