Efficient Scaling for LLM-based ASR

Bingshen Mu; Yiwen Shao; Kun Wei; Dong Yu; Lei Xie

arXiv:2508.04096·cs.SD·August 7, 2025

Efficient Scaling for LLM-based ASR

Bingshen Mu, Yiwen Shao, Kun Wei, Dong Yu, Lei Xie

PDF

TL;DR

This paper introduces EFIN, a multi-stage training strategy for LLM-based ASR that improves efficiency and performance by pretraining the speech encoder before LLM integration, supported by a derived scaling law.

Contribution

The paper proposes EFIN, a novel multi-stage training method for LLM-ASR, and derives a scaling law to guide efficient model scaling.

Findings

01

EFIN outperforms joint post-training with 21.1% CERR reduction.

02

EFIN reduces computational costs by nearly 50% FLOPs.

03

A scaling law approximates ASR error as a function of computation.

Abstract

Large language model (LLM)-based automatic speech recognition (ASR) achieves strong performance but often incurs high computational costs. This work investigates how to obtain the best LLM-ASR performance efficiently. Through comprehensive and controlled experiments, we find that pretraining the speech encoder before integrating it with the LLM leads to significantly better scaling efficiency than the standard practice of joint post-training of LLM-ASR. Based on this insight, we propose a new multi-stage LLM-ASR training strategy, EFIN: Encoder First Integration. Among all training strategies evaluated, EFIN consistently delivers better performance (relative to 21.1% CERR) with significantly lower computation budgets (49.9% FLOPs). Furthermore, we derive a scaling law that approximates ASR error rates as a computation function, providing practical guidance for LLM-ASR scaling.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.