Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty

Hongfei Xue; Yufeng Tang; Jun Zhang; Xuelong Geng; Lei Xie

arXiv:2505.16168·cs.SD·May 23, 2025

Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty

Hongfei Xue, Yufeng Tang, Jun Zhang, Xuelong Geng, Lei Xie

PDF

Open Access

TL;DR

This paper introduces SIMA, a cost-effective selective invocation method for multilingual ASR that dynamically decides whether to transcribe speech directly or invoke a high-quality model, reducing costs and improving accuracy.

Contribution

The paper presents a novel selective invocation approach based on a spoken large language model, improving multilingual ASR efficiency and accuracy over traditional language identification methods.

Findings

01

Reduces word error rate by 18.7% compared to SLLM

02

Halves invocation costs relative to LID-based methods

03

Effective across multiple datasets

Abstract

Although multilingual automatic speech recognition (ASR) systems have significantly advanced, enabling a single model to handle multiple languages, inherent linguistic differences and data imbalances challenge SOTA performance across all languages. While language identification (LID) models can route speech to the appropriate ASR model, they incur high costs from invoking SOTA commercial models and suffer from inaccuracies due to misclassification. To overcome these, we propose SIMA, a selective invocation for multilingual ASR that adapts to the difficulty level of the input speech. Built on a spoken large language model (SLLM), SIMA evaluates whether the input is simple enough for direct transcription or requires the invocation of a SOTA ASR model. Our approach reduces word error rates by 18.7% compared to the SLLM and halves invocation costs compared to LID-based methods. Tests on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Face recognition and analysis