Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR

Yuxiang Mei; Dongxing Xu; Jiaen Liang; Yanhua Long

arXiv:2601.01461·cs.CL·February 3, 2026

Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR

Yuxiang Mei, Dongxing Xu, Jiaen Liang, Yanhua Long

PDF

Open Access 1 Models

TL;DR

This paper explores and compares speech-language models and end-to-end architectures for multilingual conversational speech recognition, proposing an enhanced LLM-based system that achieves competitive results with less training data.

Contribution

It introduces an improved LLM-based ASR framework combining fine-tuned Whisper and mHuBERT encoders with novel cross-attention fusion mechanisms, and evaluates their performance against end-to-end models.

Findings

01

The proposed system achieves 10.69% CER/WER, ranking among top systems.

02

Fine-tuned Whisper models with LoRA perform well on multilingual ASR.

03

LLM-based ASR still lags behind fine-tuned end-to-end Whisper models.

Abstract

The INTERSPEECH 2025 Challenge on Multilingual Conversational Speech Language Models (MLC-SLM) promotes multilingual conversational ASR with large language models (LLMs). Our previous SHNU-mASR system adopted a competitive parallel-speech-encoder architecture that integrated Whisper and mHuBERT with an LLM. However, it faced two challenges: simple feature concatenation may not fully exploit complementary information, and the performance gap between LLM-based ASR and end-to-end(E2E) encoder-decoder ASR remained unexplored. In this work, we present an enhanced LLM-based ASR framework that combines fine-tuned Whisper and mHuBERT encoders with an LLM to enrich speech representations. We first evaluate E2E Whisper models with LoRA and full fine-tuning on the MLC-SLM ASR task, and then propose cross-attention-based fusion mechanisms for the parallel-speech-encoder. On the official evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
YuCeong-May/MLC-SLM
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems