Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR
Yuxiang Mei, Dongxing Xu, Jiaen Liang, Yanhua Long

TL;DR
This paper explores and compares speech-language models and end-to-end architectures for multilingual conversational speech recognition, proposing an enhanced LLM-based system that achieves competitive results with less training data.
Contribution
It introduces an improved LLM-based ASR framework combining fine-tuned Whisper and mHuBERT encoders with novel cross-attention fusion mechanisms, and evaluates their performance against end-to-end models.
Findings
The proposed system achieves 10.69% CER/WER, ranking among top systems.
Fine-tuned Whisper models with LoRA perform well on multilingual ASR.
LLM-based ASR still lags behind fine-tuned end-to-end Whisper models.
Abstract
The INTERSPEECH 2025 Challenge on Multilingual Conversational Speech Language Models (MLC-SLM) promotes multilingual conversational ASR with large language models (LLMs). Our previous SHNU-mASR system adopted a competitive parallel-speech-encoder architecture that integrated Whisper and mHuBERT with an LLM. However, it faced two challenges: simple feature concatenation may not fully exploit complementary information, and the performance gap between LLM-based ASR and end-to-end(E2E) encoder-decoder ASR remained unexplored. In this work, we present an enhanced LLM-based ASR framework that combines fine-tuned Whisper and mHuBERT encoders with an LLM to enrich speech representations. We first evaluate E2E Whisper models with LoRA and full fine-tuning on the MLC-SLM ASR task, and then propose cross-attention-based fusion mechanisms for the parallel-speech-encoder. On the official evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems
