Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems

Tuan Nguyen; Long-Vu Hoang; Huy-Dat Tran

arXiv:2506.13596·cs.CL·July 8, 2025

Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems

Tuan Nguyen, Long-Vu Hoang, Huy-Dat Tran

PDF

Open Access

TL;DR

This paper compares the integration of Whisper with Gemma and Qwen models for multilingual speech recognition, demonstrating competitive results in the MLC-SLM Challenge 2025 through a multi-stage training approach.

Contribution

It introduces a novel system combining Whisper with Gemma and Qwen models, employing a three-stage training process for multilingual speech recognition.

Findings

01

Achieved 16.63% WER/CER with Gemma3-12B

02

Achieved 18.6% WER/CER with Qwen2.5-7B

03

Demonstrated competitive performance in MLC-SLM Challenge 2025

Abstract

This paper presents our system for the MLC-SLM Challenge 2025, focusing on multilingual speech recognition and language modeling with large language models (LLMs). Our approach combines a fine-tuned Whisper-large-v3 encoder with efficient projector architectures and various decoder configurations. We employ a three-stage training methodology that progressively optimizes the encoder, projector, and LLM components. Our system achieves competitive performance with a private test average WER/CER result of 16.63% using the Gemma3-12B and 18.6% using the Qwen2.5-7B as decoder-only language model.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Multi-Agent Systems and Negotiation