Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM
Jiatong Shi, Chunlei Zhang, Jinchuan Tian, Junrui Ni, Hao Zhang, Shinji Watanabe, Dong Yu

TL;DR
This paper introduces a continual pre-training framework that adapts speech LLMs to handle codec-based speech, enabling a unified model for understanding and generation tasks with strong performance and novel end-to-end translation capabilities.
Contribution
The paper presents a novel CPT approach for codec-based speech LLMs, achieving unified understanding and generation, including the first end-to-end neural codec token-based S2S translation system.
Findings
Effective cross-modal alignment with CPT
Strong performance across ASR, TTS, and translation tasks
First end-to-end neural codec token S2S translation system
Abstract
Recent advances in speech language models (LLMs) have extended textual LLMs to the speech domain, but balancing speech understanding and generation remains challenging, especially with codec-based representations. We propose a continual pre-training (CPT) framework that adapts a textual LLM to handle codec-discretized speech, mitigating modality mismatch and preserving linguistic reasoning. Our unified model supports both understanding and generation, achieving strong results across ASR, TTS, S2T-Trans, and S2S-Trans. Notably, we present the first end-to-end, single-pass S2S-Trans system using only neural codec tokens, without intermediate transcriptions, translations, or semantic tokens. CPT proves essential for cross-modal alignment and task generalization, making it a powerful tool for building robust, unified speech LLMs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
