Adapting Text LLMs to Speech via Multimodal Depth Up-Scaling
Kazuki Yano, Jun Suzuki, Shinji Watanabe

TL;DR
This paper introduces Multimodal Depth Upscaling, a method for adapting text LLMs to speech tasks by inserting and training new transformer layers, achieving high ASR performance with minimal degradation of original text capabilities.
Contribution
The paper presents a novel depth up-scaling approach that inserts and trains new transformer layers into frozen text LLMs for speech adaptation, outperforming full fine-tuning and LoRA in preserving text capabilities.
Findings
Depth up-scaling achieves ASR performance comparable to full fine-tuning.
Incorporating E-Branchformer layers improves ASR results and reduces text degradation.
The method requires fewer trainable parameters while maintaining high speech recognition accuracy.
Abstract
Adapting pre-trained text Large Language Models (LLMs) into Speech Language Models (Speech LMs) via continual pretraining on speech data is promising, but often degrades the original text capabilities. We propose Multimodal Depth Upscaling, an extension of an emerging strategy in continual LLM pre-training, where new transformer layers are inserted into a frozen text LLM and only the added layers are trained on speech data. Experiments with SmolLM2-360M and SmolLM2-1.7B on 48k hours of English Automatic Speech Recognition (ASR) data show that depth up-scaling achieves ASR comparable to full fine-tuning while causing far less text degradation than both full fine-tuning and Low-Rank Adaptation (LoRA). We further show that incorporating E-Branchformer, an architecture designed for speech recognition, as the inserted layers achieves ASR that matches or surpasses full fine-tuning on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
