Seewo's Submission to MLC-SLM: Lessons learned from Speech Reasoning Language Models

Bo Li; Chengben Xu; Wufeng Zhang

arXiv:2506.13300·cs.CL·June 19, 2025

Seewo's Submission to MLC-SLM: Lessons learned from Speech Reasoning Language Models

Bo Li, Chengben Xu, Wufeng Zhang

PDF

Open Access

TL;DR

Seewo's submission for the MLC-SLM challenge introduces a multi-stage training pipeline with curriculum learning, data augmentation, and reinforcement learning to improve speech recognition and diarization, achieving state-of-the-art results.

Contribution

The paper presents a novel multi-stage training approach incorporating curriculum learning, Chain-of-Thought augmentation, and RLVR for enhanced speech reasoning and self-correction.

Findings

01

Achieved WER of 11.57% and CER of 17.67% on challenge datasets.

02

Demonstrated the effectiveness of each training component through ablation studies.

03

Significantly outperformed official challenge baselines.

Abstract

This paper presents Seewo's systems for both tracks of the Multilingual Conversational Speech Language Model Challenge (MLC-SLM), addressing automatic speech recognition (ASR) and speaker diarization with ASR (SD-ASR). We introduce a multi-stage training pipeline that explicitly enhances reasoning and self-correction in speech language models for ASR. Our approach combines curriculum learning for progressive capability acquisition, Chain-of-Thought data augmentation to foster intermediate reflection, and Reinforcement Learning with Verifiable Rewards (RLVR) to further refine self-correction through reward-driven optimization. This approach achieves substantial improvements over the official challenge baselines. On the evaluation set, our best system attains a WER/CER of 11.57% for Track 1 and a tcpWER/tcpCER of 17.67% for Track 2. Comprehensive ablation studies demonstrate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques