The Eloquence team submission for task 1 of MLC-SLM challenge

Lorenzo Concina; Jordi Luque; Alessio Brutti; Marco Matassoni; Yuchen Zhang

arXiv:2507.19308·cs.SD·July 28, 2025

The Eloquence team submission for task 1 of MLC-SLM challenge

Lorenzo Concina, Jordi Luque, Alessio Brutti, Marco Matassoni, Yuchen Zhang

PDF

Open Access

TL;DR

This paper explores three approaches to improve multilingual conversational speech recognition, including baseline evaluation, custom projector training, and contrastive learning, to enhance robustness in spoken dialogue systems.

Contribution

It introduces novel experiments with different projector architectures and contrastive learning techniques within the MLC-SLM challenge context.

Findings

01

Baseline evaluation reveals strengths and limitations.

02

Custom multilingual projector improves recognition accuracy.

03

Contrastive learning enhances robustness in conversational speech recognition.

Abstract

In this paper, we present our studies and experiments carried out for the task 1 of the Challenge and Workshop on Multilingual Conversational Speech Language Model (MLC-SLM), which focuses on advancing multilingual conversational speech recognition through the development of speech language models architectures. Given the increasing relevance of real-world conversational data for building robust Spoken Dialogue Systems, we explore three approaches to multilingual ASR. First, we conduct an evaluation of the official baseline to better understand its strengths and limitations, by training two projectors (linear and qformer) with different foundation models. Second we leverage the SLAM-ASR framework to train a custom multilingual linear projector. Finally we investigate the role of contrastive learning and the extended conversational context in enhancing the robustness of recognition.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Emotion and Mood Recognition